-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
CI Failure (segment replaced with no change in size) in TieredStorageTest.test_tiered_storage
#13753
Comments
Also seen on: https://buildkite.com/redpanda/redpanda/builds/37810
|
Sorry, Evgeny. Hadn't seen you self assigned. Here's what happens: TODO: add logs |
This sounds pretty similar to #12846 |
We should fix this. In this specific case, we handle it gracefully, but there's no guarantee we'll do this for all commands. |
Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will use the cached last offset of the last replicated batch and wait on its application. Fixes redpanda-data#13753
Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will use the cached last offset of the last replicated batch and wait on its application. Fixes redpanda-data#13753
Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will use the cached last offset of the last replicated batch and wait on its application. Fixes redpanda-data#13753
Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will use the cached last offset of the last replicated batch and wait on its application. Fixes redpanda-data#13753
Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will use the cached last offset of the last replicated batch and wait on its application. Fixes redpanda-data#13753
Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will use the cached last offset of the last replicated batch and wait on its application. Fixes redpanda-data#13753
Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will use the cached last offset of the last replicated batch and wait on its application. Fixes redpanda-data#13753
Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will use the cached last offset of the last replicated batch and wait on its application. Fixes redpanda-data#13753
Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will wait for any ongoing replications to finish and then wait for the commited offest to be applied. Fixes redpanda-data#13753
Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will wait for any ongoing replications to finish and then wait for the commited offest to be applied. Fixes redpanda-data#13753
Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will wait for any ongoing replications to finish and then wait for the commited offest to be applied. Fixes redpanda-data#13753
Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will wait for any ongoing replications to finish and then wait for the commited offest to be applied. Fixes redpanda-data#13753
Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will wait for any ongoing replications to finish and then wait for the commited offest to be applied. Fixes redpanda-data#13753
Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will wait for any ongoing replications to finish and then wait for the commited offest to be applied. Fixes redpanda-data#13753
Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will wait for any ongoing replications to finish and then wait for the commited offest to be applied. Fixes redpanda-data#13753
Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will wait for any ongoing replications to finish and then wait for the commited offest to be applied. Fixes redpanda-data#13753
Previously, the archival STM used `persisted_stm::sync` in order to ensure that its log is up to date before issuing new commands. This works fine for leadership changes since they imply a term change. However, the archiver may need to sync during the same term. We've seen scenarios like this: > Upload loop picks 4 candidates, uploads to cloud storage and replicates a batch with 4 new segments. After the replication command succeeds, the topic retention is updated. This re-starts the archiver. When the archiver restarts it picks the same 4 segments and replicates the addition commands again. Finally, the partition manifest detects this, refuses the commands and prints the error log. The previous sync method does not support such cases sync they happen in the same term. To fix this, the archival STM now implements its own sync method. If it detects that the term has changed, it will maintain the pre-existing behaviour: sync up to the latest term. Otherwise, it will wait for any ongoing replications to finish and then wait for the commited offest to be applied. Fixes #13753
https://buildkite.com/redpanda/redpanda/builds/37795
The text was updated successfully, but these errors were encountered: