CI Failure (segment replaced with no change in size) in `TieredStorageTest.test_tiered_storage` #13753

abhijat · 2023-09-28T08:43:45Z

https://buildkite.com/redpanda/redpanda/builds/37795

Module: rptest.tests.tiered_storage_model_test
Class: TieredStorageTest
Method: test_tiered_storage
Arguments: {
    "test_case": {
        "name": "(TS_Read == True, SpilloverManifestUploaded == True)"
    },
    "cloud_storage_type": 2
}

test_id:    TieredStorageTest.test_tiered_storage
status:     FAIL
run time:   108.422 seconds

<BadLogLines nodes=docker-rp-19(8),docker-rp-11(8),docker-rp-12(8) example="ERROR 2023-09-28 03:52:28,062 [shard 1:main] cloud_storage - partition_manifest.cc:1024 - [{kafka/topic-vfoseiytae/0}] New replacement segment has the same size as replaced segment: new_segment: {o=29720-30744 t={timestamp: 1695873131347}-{timestamp: 1695873142470}}, replaced_segment: {o=29720-30744 t={timestamp: 1695873131347}-{timestamp: 1695873142470}}">
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/usr/local/lib/python3.10/dist-packages/ducktape/mark/_mark.py", line 481, in wrapper
    return functools.partial(f, *args, **kwargs)(*w_args, **w_kwargs)
  File "/root/tests/rptest/services/cluster.py", line 142, in wrapped
    redpanda.raise_on_bad_logs(
  File "/root/tests/rptest/services/redpanda.py", line 1244, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-19(8),docker-rp-11(8),docker-rp-12(8) example="ERROR 2023-09-28 03:52:28,062 [shard 1:main] cloud_storage - partition_manifest.cc:1024 - [{kafka/topic-vfoseiytae/0}] New replacement segment has the same size as replaced segment: new_segment: {o=29720-30744 t={timestamp: 1695873131347}-{timestamp: 1695873142470}}, replaced_segment: {o=29720-30744 t={timestamp: 1695873131347}-{timestamp: 1695873142470}}">

The text was updated successfully, but these errors were encountered:

abhijat · 2023-09-28T08:53:28Z

Also seen on:

https://buildkite.com/redpanda/redpanda/builds/37810

Module: rptest.tests.cloud_storage_usage_test
Class:  CloudStorageUsageTest
Method: test_cloud_storage_usage_reporting_with_partition_moves

====================================================================================================
test_id:    rptest.tests.cloud_storage_usage_test.CloudStorageUsageTest.test_cloud_storage_usage_reporting_with_partition_moves
status:     FAIL
run time:   1 minute 28.912 seconds


    <BadLogLines nodes=docker-rp-20(4) example="ERROR 2023-09-28 06:54:56,162 [shard 1:main] cloud_storage - partition_manifest.cc:1024 - [{kafka/test-topic-2/0}] New replacement segment has the same size as replaced segment: new_segment: {o=347-382 t={timestamp: 1695884063937}-{timestamp: 1695884065937}}, replaced_segment: {o=347-382 t={timestamp: 1695884063937}-{timestamp: 1695884065875}}">
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 184, in _do_run
    data = self.run_test()
  File "/usr/local/lib/python3.10/dist-packages/ducktape/tests/runner_client.py", line 269, in run_test
    return self.test_context.function(self.test)
  File "/root/tests/rptest/services/cluster.py", line 142, in wrapped
    redpanda.raise_on_bad_logs(
  File "/root/tests/rptest/services/redpanda.py", line 1244, in raise_on_bad_logs
    raise BadLogLines(bad_lines)
rptest.services.utils.BadLogLines: <BadLogLines nodes=docker-rp-20(4) example="ERROR 2023-09-28 06:54:56,162 [shard 1:main] cloud_storage - partition_manifest.cc:1024 - [{kafka/test-topic-2/0}] New replacement segment has the same size as replaced segment: new_segment: {o=347-382 t={timestamp: 1695884063937}-{timestamp: 1695884065937}}, replaced_segment: {o=347-382 t={timestamp: 1695884063937}-{timestamp: 1695884065875}}">

VladLazar · 2023-09-28T17:51:55Z

Sorry, Evgeny. Hadn't seen you self assigned.

Here's what happens:
Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log.

TODO: add logs

abhijat · 2023-09-29T07:48:20Z

Sorry, Evgeny. Hadn't seen you self assigned.

Here's what happens: Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log.

TODO: add logs

This sounds pretty similar to #12846

VladLazar · 2023-09-29T15:43:24Z

We should fix this. In this specific case, we handle it gracefully, but there's no guarantee we'll do this for all commands.

Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will use the cached last offset of the last replicated batch and wait on its application. Fixes redpanda-data#13753

Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will wait for any ongoing replications to finish and then wait for the commited offest to be applied. Fixes redpanda-data#13753

Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will wait for any ongoing replications to finish and then wait for the commited offest to be applied. Fixes #13753

abhijat added kind/bug Something isn't working ci-failure area/cloud-storage Shadow indexing subsystem labels Sep 28, 2023

Lazin added the sev/medium Bugs that do not meet criteria for high or critical, but are more severe than low. label Sep 28, 2023

Lazin self-assigned this Sep 28, 2023

VladLazar assigned VladLazar and unassigned Lazin Sep 28, 2023

redpanda-data deleted a comment from abhijat Sep 29, 2023

This was referenced Oct 3, 2023

archival: same term sync-ing #13896

Merged

cloud_storage: fix error logging on shutdown #13806

Merged

andijcr mentioned this issue Oct 6, 2023

ticket to track ci failure "New replacement segment has the same size as replaced segment" #13997

Closed

piyushredpanda closed this as completed in #13896 Oct 23, 2023

github-actions bot mentioned this issue Nov 13, 2023

update redpanda appVersion from v23.2.14 to v23.2.15 redpanda-data/helm-charts#867

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI Failure (segment replaced with no change in size) in `TieredStorageTest.test_tiered_storage` #13753

CI Failure (segment replaced with no change in size) in `TieredStorageTest.test_tiered_storage` #13753

abhijat commented Sep 28, 2023

abhijat commented Sep 28, 2023 •

edited

Loading

VladLazar commented Sep 28, 2023

abhijat commented Sep 29, 2023

VladLazar commented Sep 29, 2023

CI Failure (segment replaced with no change in size) in TieredStorageTest.test_tiered_storage #13753

CI Failure (segment replaced with no change in size) in TieredStorageTest.test_tiered_storage #13753

Comments

abhijat commented Sep 28, 2023

abhijat commented Sep 28, 2023 • edited Loading

VladLazar commented Sep 28, 2023

abhijat commented Sep 29, 2023

VladLazar commented Sep 29, 2023

CI Failure (segment replaced with no change in size) in `TieredStorageTest.test_tiered_storage` #13753

CI Failure (segment replaced with no change in size) in `TieredStorageTest.test_tiered_storage` #13753

abhijat commented Sep 28, 2023 •

edited

Loading