-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
MDEV-36226 Stall and crash when page cleaner fails to generate free p… #3885
base: 10.6
Are you sure you want to change the base?
Conversation
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change to the logic looks reasonable to me, but in the diagnostic output I’d avoid excessive numbers of calls to operator<<()
and use the common logging functions sql_print_warning()
or sql_print_information()
.
e190624
to
de96813
Compare
storage/innobase/buf/buf0flu.cc
Outdated
sql_print_information("Innodb: LSN flush parameters\n" | ||
"-------------------\n" | ||
"System LSN : %" PRIu64 "\n" | ||
"Checkpoint LSN: %" PRIu64 "\n" | ||
"Flush ASync LSN: %" PRIu64 "\n" | ||
"Flush Sync LSN: %" PRIu64 "\n" | ||
"-------------------", | ||
lsn, clsn, buf_flush_async_lsn, buf_flush_sync_lsn); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually all InnoDB messages are prefixed by InnoDB:
(note the case). Do we need this many rows for the output? You need to write uint64_t{buf_flush_async_lsn}
or similar to avoid compilation errors:
error: cannot pass object of non-trivial type 'Atomic_relaxed<lsn_t>' (aka 'Atomic_relaxed<unsigned long long>') through variadic function; call will abort at runtime [-Wnon-pod-varargs]
ulint lru_size= UT_LIST_GET_LEN(LRU); | ||
ulint dirty_size= UT_LIST_GET_LEN(flush_list); | ||
ulint free_size= UT_LIST_GET_LEN(free); | ||
ulint dirty_pct= lru_size ? dirty_size * 100 / (lru_size + free_size) : 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dirty_pct
seems to be redundant information that can be calculated from the rest. It could also be totally misleading, because we were reading these fields without proper mutex
or flush_list_mutex
protection.
…ages during Async flush During regular iteration the page cleaner does flush from flush list with some flush target and then goes for generating free pages from LRU tail. When asynchronous flush is triggered i.e. when 7/8 th of the LSN margin is filled in the redo log, the flush target for flush list is set to innodb_io_capacity_max. If it could flush all, the flush bandwidth for LRU flush is currently set to zero. If the LRU tail has dirty pages, page cleaner ends up freeing no pages in one iteration. The scenario could repeat across multiple iterations till async flush target is reached. During this time the DB system is starved of free pages resulting in apparent stall and in some cases dict_sys latch fatal error. Fix: In page cleaner iteration, before LRU flush, ensure we provide enough flush limit so that freeing pages is no blocked by dirty pages in LRU tail. Log IO and flush state if double write flush wait is long. Reviewed by: Marko Mäkelä
de96813
to
3ab4934
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic looks OK to me, but I would like to see an update from our performance testing team. Hopefully they will have some luck reproducing the original issue and validating this fix, which is basically based on an educated guess and a careful review of the existing logic.
Please consider my suggestions about the formatting and the timeout logic in buf_dblwr_t::flush_buffered_writes()
. There seems to be a format string mismatch that affects 64-bit Microsoft Windows.
qr/InnoDB: Long wait \([0-9]+ seconds\) for double-write buffer flush/, | ||
qr/Slave: Unknown table 't1' .* 1051/, | ||
qr/Slave SQL:.*(Internal MariaDB error code: [[:digit:]]+|Query:.*)/, | ||
qr/slave SQL thread aborted/, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The added pattern seems to be incorrect and should therefore not have the desired effect of preventing sporadic test failures in a heavily loaded environment. The regular expression group delimiters ()
are metacharacters in Perl regular expressions, as you can see 2 lines below the addition.
sql_print_information("Innodb: Double Write State\n" | ||
"-------------------\n" | ||
"Batch running : %s\n" | ||
"Active Slot - first_free: %lu reserved: %lu\n" | ||
"Flush Slot - first_free: %lu reserved: %lu\n" | ||
"-------------------", | ||
(batch_running ? "true" : "false"), | ||
active_slot->first_free, active_slot->reserved, | ||
flush_slot->first_free, flush_slot->reserved); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error log messages should start with InnoDB:
in the correct capitalization. This output could be packed more tightly. I am not sure that we should start outputting multi-line output without starting each line with a timestamp. Even the Linux kernel dmesg
prefixes each line with a timestamp, also when outputting structured output, such as a process table when the out-of-memory killer kicks in. I would suggest the following:
sql_print_information("InnoDB: doublewrite %s (active: %zu, %zu) (flush: %zu, %zu)",
batch_running ? "active" : "idle",
active_slot->first_free, active_slot->reserved,
flush_slot->first_free, flush_slot->reserved);
There is a type mismatch between %lu
and size_t
a.k.a. ulint
. On LLP64 targets, such as Microsoft Windows, sizeof(unsigned long)
would be 4 while sizeof(size_t)
is 8. The correct format is %zu
or " ULINTPF "
.
More verbose output is not necessarily any more readable to an end user. When analyzing the output, it should not be much of an extra hurdle to consult the source code for decoding the output.
const ulong max_count= 60 * 60; | ||
const ulong first_log_count= 30; | ||
const ulong fatal_threshold= srv_fatal_semaphore_wait_threshold; | ||
ulong log_count= first_log_count; | ||
|
||
for (ulong count= 0;;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any use of the types ulong
, unsigned long
or long
should be generally avoided, because it is introducing an unnecessary incompatibility between LP64 and LLP64 targets. I would suggest unsigned
or size_t
.
my_cond_wait(&cond, &mutex.m_mutex); | ||
} | ||
|
||
timespec abstime; | ||
set_timespec(abstime, 1); | ||
my_cond_timedwait(&cond, &mutex.m_mutex, &abstime); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn’t we avoid the timeout handling when my_cond_timedwait()
returns 0?
if (!my_cond_timedwait(&cond, &mutex.m_mutex, &abstime))
{
/* buf_dblwr_t::write_completed()` had been invoked */
count= 0;
continue;
}
Can we set a longer timeout than 1 second? Could it be somehow based on srv_fatal_semaphore_wait_threshold
, in such a way that one of the subsequent conditions on count
will always hold? It would seem that we could choose the limit to be std::min(5UL, srv_fatal_semaphore_wait_threshold)
.
/* If we have exhausted flush quota, it is likely we exited before | ||
generating enough free pages. Call once more with 0 flush to generate | ||
free pages immediately as required. */ | ||
if (pages == max_n) | ||
buf_do_LRU_batch(0, &n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For safety, could the condition be pages >= max_n
?
sql_print_information("Innodb: Buffer Pool pages \n" | ||
"-------------------\n" | ||
"LRU Pages: %lu\n" | ||
"Free Pages: %lu\n" | ||
"Dirty Pages: %lu : %lu%%\n" | ||
"-------------------", | ||
lru_size, free_size, dirty_size, dirty_pct); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I pointed out about buf_dblwr_t::print_info()
, I am not a friend of this kind of a multi-line output. I’m not sure if we should extend the SHOW ENGINE INNODB STATUS
type spamming in the server error log. Maybe we can argue that it this is output very rarely, and having a more verbose and structured looking output would make it stand out from the server error log.
Because this very often should precede a crash report, it would be natural to model this after output_core_info()
. It is outputting various things (also multi-line output) with my_safe_printf_stderr()
, which turns out to be a simple wrapper around my_vsnprintf()
and my_write_stderr()
, which in turn is simply invoking write(2)
. I assume that we need this in a signal handler, because a signal could be sent while the object stderr
is in an inconsistent state. Because we’re outside any signal handling code here, it could be simplest to invoke fprintf(stderr, …)
like you did in your original revision of this. (Sorry, I should have checked this earlier.)
It seems that each call of buf_pool.print_flush_info()
is immediately followed by a timestamp and InnoDB:
. Therefore, it could be simplest to just have a few fprintf(stderr, …)
here, without any timestamp or InnoDB prefix.0
sql_print_information("Innodb: LSN flush parameters\n" | ||
"-------------------\n" | ||
"System LSN : %" PRIu64 "\n" | ||
"Checkpoint LSN: %" PRIu64 "\n" | ||
"Flush ASync LSN: %" PRIu64 "\n" | ||
"Flush Sync LSN: %" PRIu64 "\n" | ||
"-------------------", | ||
lsn, clsn, buf_flush_async_lsn.load(), buf_flush_sync_lsn.load()); | ||
|
||
lsn_t age= lsn - clsn; | ||
lsn_t age_pct= log_sys.max_checkpoint_age | ||
? age * 100 / log_sys.max_checkpoint_age : 0; | ||
sql_print_information("Innodb: LSN age parameters\n" | ||
"-------------------\n" | ||
"Current Age : %" PRIu64 " : %" PRIu64 "%%\n" | ||
"Max Age(Async): %" PRIu64 "\n" | ||
"Max Age(Sync) : %" PRIu64 "\n" | ||
"Capacity : %" PRIu64 "\n" | ||
"-------------------", | ||
age, age_pct, log_sys.max_modified_age_async, log_sys.max_checkpoint_age, | ||
log_sys.log_capacity); | ||
|
||
sql_print_information("Innodb: Pending IO count\n" | ||
"-------------------\n" | ||
"Pending Read: %zu\n" | ||
"Pending Write: %zu\n" | ||
"-------------------", | ||
os_aio_pending_reads_approx(), os_aio_pending_writes_approx()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’d still prefer packing as much information per line as possible, and I would omit any math that can be calculated from the numbers afterwards. These would include dirty_pct
, age
, and age_pct
. Maybe with that, we could have all LSN information in one line:
sql_print_information("InnoDB: LSN " LSN_PF "; checkpoint " LSN_PF
", flush " LSN_PF "," LSN_PF
"; max age " LSN_PF "," LSN_PF
"; capacity " LSN_PF,
lsn, clsn,
buf_flush_async_lsn.load(), buf_flush_sync_lsn.load(),
log_sys.max_modified_age_async,
log_sys.max_checkpoint_age,
log_sys.log_capacity);
sql_print_information("InnoDB: Pending page reads: %zu, writes: %zu",
os_aio_pending_reads_approx(), os_aio_pending_writes_approx());
These numbers are not saying much to an end user anyway; a subject matter expert would be needed for decoding this, and the code that produces it is open source. So, we would not seem to lose that much by having slightly less verbose output. Having more data per line would make it easier to filter the output with tools like grep
, sed
, awk
or perl
.
Description
During regular iteration the page cleaner does flush from flush list with some flush target and then goes for generating free pages from LRU tail. When asynchronous flush is triggered i.e. when 7/8 th of the LSN margin is filled in the redo log, the flush target for flush list is set to innodb_io_capacity_max. If it could flush all, the flush bandwidth for LRU flush is currently set to zero. If the LRU tail has dirty pages, page cleaner ends up freeing no pages in one iteration. The scenario could repeat across multiple iterations till async flush target is reached. During this time the DB system is starved of free pages resulting in apparent stall and in some cases dict_sys latch fatal error.
Fix: In page cleaner iteration, before LRU flush, ensure we provide enough flush limit so that freeing pages is no blocked by dirty pages in LRU tail. Log IO and flush state if double write flush wait is long.
Impact: It could result in increased IO due to LRU flush in specific cases.
Release Notes
None
How can this PR be tested?
Regular Innodb test should cover the path. Performance and stress Test should be run to judge for possible impact.
Reproducing the base issue would require large buffer pool, long run and synchronization between foreground and Innodb background threads.
Basing the PR against the correct MariaDB version
main
branch.PR quality check