MDEV-36226 Stall and crash when page cleaner fails to generate free p… #3885

mariadb-DebarunBanerjee · 2025-03-10T07:23:02Z

The Jira issue number for this PR is: MDEV-36226

Description

During regular iteration the page cleaner does flush from flush list with some flush target and then goes for generating free pages from LRU tail. When asynchronous flush is triggered i.e. when 7/8 th of the LSN margin is filled in the redo log, the flush target for flush list is set to innodb_io_capacity_max. If it could flush all, the flush bandwidth for LRU flush is currently set to zero. If the LRU tail has dirty pages, page cleaner ends up freeing no pages in one iteration. The scenario could repeat across multiple iterations till async flush target is reached. During this time the DB system is starved of free pages resulting in apparent stall and in some cases dict_sys latch fatal error.

Fix: In page cleaner iteration, before LRU flush, ensure we provide enough flush limit so that freeing pages is no blocked by dirty pages in LRU tail. Log IO and flush state if double write flush wait is long.

Impact: It could result in increased IO due to LRU flush in specific cases.

Release Notes

None

How can this PR be tested?

Regular Innodb test should cover the path. Performance and stress Test should be run to judge for possible impact.

Reproducing the base issue would require large buffer pool, long run and synchronization between foreground and Innodb background threads.

Basing the PR against the correct MariaDB version

This is a new feature or a refactoring, and the PR is based against the main branch.
This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

CLAassistant · 2025-03-10T07:23:10Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

dr-m

The change to the logic looks reasonable to me, but in the diagnostic output I’d avoid excessive numbers of calls to operator<<() and use the common logging functions sql_print_warning() or sql_print_information().

storage/innobase/buf/buf0dblwr.cc

storage/innobase/buf/buf0flu.cc

storage/innobase/include/buf0flu.h

storage/innobase/buf/buf0flu.cc

storage/innobase/buf/buf0dblwr.cc

storage/innobase/buf/buf0flu.cc

dr-m · 2025-03-13T07:46:15Z

storage/innobase/buf/buf0flu.cc

+  sql_print_information("Innodb: LSN flush parameters\n"
+    "-------------------\n"
+    "System LSN     : %" PRIu64 "\n"
+    "Checkpoint  LSN: %" PRIu64 "\n"
+    "Flush ASync LSN: %" PRIu64 "\n"
+    "Flush Sync  LSN: %" PRIu64 "\n"
+    "-------------------",
+    lsn, clsn, buf_flush_async_lsn, buf_flush_sync_lsn);


Usually all InnoDB messages are prefixed by InnoDB: (note the case). Do we need this many rows for the output? You need to write uint64_t{buf_flush_async_lsn} or similar to avoid compilation errors:

error: cannot pass object of non-trivial type 'Atomic_relaxed<lsn_t>' (aka 'Atomic_relaxed<unsigned long long>') through variadic function; call will abort at runtime [-Wnon-pod-varargs]

dr-m · 2025-03-13T07:47:37Z

storage/innobase/buf/buf0flu.cc

+  ulint lru_size= UT_LIST_GET_LEN(LRU);
+  ulint dirty_size= UT_LIST_GET_LEN(flush_list);
+  ulint free_size= UT_LIST_GET_LEN(free);
+  ulint dirty_pct= lru_size ? dirty_size * 100 / (lru_size + free_size) : 0;


dirty_pct seems to be redundant information that can be calculated from the rest. It could also be totally misleading, because we were reading these fields without proper mutex or flush_list_mutex protection.

storage/innobase/buf/buf0dblwr.cc

…ages during Async flush During regular iteration the page cleaner does flush from flush list with some flush target and then goes for generating free pages from LRU tail. When asynchronous flush is triggered i.e. when 7/8 th of the LSN margin is filled in the redo log, the flush target for flush list is set to innodb_io_capacity_max. If it could flush all, the flush bandwidth for LRU flush is currently set to zero. If the LRU tail has dirty pages, page cleaner ends up freeing no pages in one iteration. The scenario could repeat across multiple iterations till async flush target is reached. During this time the DB system is starved of free pages resulting in apparent stall and in some cases dict_sys latch fatal error. Fix: In page cleaner iteration, before LRU flush, ensure we provide enough flush limit so that freeing pages is no blocked by dirty pages in LRU tail. Log IO and flush state if double write flush wait is long. Reviewed by: Marko Mäkelä

dr-m

The logic looks OK to me, but I would like to see an update from our performance testing team. Hopefully they will have some luck reproducing the original issue and validating this fix, which is basically based on an educated guess and a careful review of the existing logic.

Please consider my suggestions about the formatting and the timeout logic in buf_dblwr_t::flush_buffered_writes(). There seems to be a format string mismatch that affects 64-bit Microsoft Windows.

dr-m · 2025-03-21T13:21:42Z

mysql-test/mariadb-test-run.pl

+     qr/InnoDB: Long wait \([0-9]+ seconds\) for double-write buffer flush/,
     qr/Slave: Unknown table 't1' .* 1051/,
     qr/Slave SQL:.*(Internal MariaDB error code: [[:digit:]]+|Query:.*)/,
     qr/slave SQL thread aborted/,


The added pattern seems to be incorrect and should therefore not have the desired effect of preventing sporadic test failures in a heavily loaded environment. The regular expression group delimiters () are metacharacters in Perl regular expressions, as you can see 2 lines below the addition.

dr-m · 2025-03-21T13:29:39Z

storage/innobase/buf/buf0dblwr.cc

+  sql_print_information("Innodb: Double Write State\n"
+      "-------------------\n"
+      "Batch running : %s\n"
+      "Active Slot - first_free: %lu reserved:  %lu\n"
+      "Flush  Slot - first_free: %lu reserved:  %lu\n"
+      "-------------------",
+      (batch_running ? "true" : "false"),
+      active_slot->first_free, active_slot->reserved,
+      flush_slot->first_free, flush_slot->reserved);


Error log messages should start with InnoDB: in the correct capitalization. This output could be packed more tightly. I am not sure that we should start outputting multi-line output without starting each line with a timestamp. Even the Linux kernel dmesg prefixes each line with a timestamp, also when outputting structured output, such as a process table when the out-of-memory killer kicks in. I would suggest the following:

sql_print_information("InnoDB: doublewrite %s (active: %zu, %zu) (flush: %zu, %zu)", batch_running ? "active" : "idle", active_slot->first_free, active_slot->reserved, flush_slot->first_free, flush_slot->reserved);

There is a type mismatch between %lu and size_t a.k.a. ulint. On LLP64 targets, such as Microsoft Windows, sizeof(unsigned long) would be 4 while sizeof(size_t) is 8. The correct format is %zu or " ULINTPF ".

More verbose output is not necessarily any more readable to an end user. When analyzing the output, it should not be much of an extra hurdle to consult the source code for decoding the output.

dr-m · 2025-03-21T13:31:06Z

storage/innobase/buf/buf0dblwr.cc

+  const ulong max_count= 60 * 60;
+  const ulong first_log_count= 30;
+  const ulong fatal_threshold= srv_fatal_semaphore_wait_threshold;
+  ulong log_count= first_log_count;
+
+  for (ulong count= 0;;)


Any use of the types ulong, unsigned long or long should be generally avoided, because it is introducing an unnecessary incompatibility between LP64 and LLP64 targets. I would suggest unsigned or size_t.

dr-m · 2025-03-21T13:42:15Z

storage/innobase/buf/buf0dblwr.cc

-    my_cond_wait(&cond, &mutex.m_mutex);
-  }

+    timespec abstime;
+    set_timespec(abstime, 1);
+    my_cond_timedwait(&cond, &mutex.m_mutex, &abstime);


Shouldn’t we avoid the timeout handling when my_cond_timedwait() returns 0?

if (!my_cond_timedwait(&cond, &mutex.m_mutex, &abstime)) { /* buf_dblwr_t::write_completed()` had been invoked */ count= 0; continue; }

Can we set a longer timeout than 1 second? Could it be somehow based on srv_fatal_semaphore_wait_threshold, in such a way that one of the subsequent conditions on count will always hold? It would seem that we could choose the limit to be std::min(5UL, srv_fatal_semaphore_wait_threshold).

dr-m · 2025-03-21T13:43:49Z

storage/innobase/buf/buf0flu.cc

+  /* If we have exhausted flush quota, it is likely we exited before
+  generating enough free pages. Call once more with 0 flush to generate
+  free pages immediately as required. */
+  if (pages == max_n)
+    buf_do_LRU_batch(0, &n);


For safety, could the condition be pages >= max_n?

dr-m · 2025-03-21T13:57:16Z

storage/innobase/buf/buf0flu.cc

+  sql_print_information("Innodb: Buffer Pool pages \n"
+    "-------------------\n"
+    "LRU   Pages: %lu\n"
+    "Free  Pages: %lu\n"
+    "Dirty Pages: %lu : %lu%%\n"
+    "-------------------",
+    lru_size, free_size, dirty_size, dirty_pct);


Like I pointed out about buf_dblwr_t::print_info(), I am not a friend of this kind of a multi-line output. I’m not sure if we should extend the SHOW ENGINE INNODB STATUS type spamming in the server error log. Maybe we can argue that it this is output very rarely, and having a more verbose and structured looking output would make it stand out from the server error log.

Because this very often should precede a crash report, it would be natural to model this after output_core_info(). It is outputting various things (also multi-line output) with my_safe_printf_stderr(), which turns out to be a simple wrapper around my_vsnprintf() and my_write_stderr(), which in turn is simply invoking write(2). I assume that we need this in a signal handler, because a signal could be sent while the object stderr is in an inconsistent state. Because we’re outside any signal handling code here, it could be simplest to invoke fprintf(stderr, …) like you did in your original revision of this. (Sorry, I should have checked this earlier.)

It seems that each call of buf_pool.print_flush_info() is immediately followed by a timestamp and InnoDB:. Therefore, it could be simplest to just have a few fprintf(stderr, …) here, without any timestamp or InnoDB prefix.0

dr-m · 2025-03-21T14:11:35Z

storage/innobase/buf/buf0flu.cc

+  sql_print_information("Innodb: LSN flush parameters\n"
+    "-------------------\n"
+    "System LSN     : %" PRIu64 "\n"
+    "Checkpoint  LSN: %" PRIu64 "\n"
+    "Flush ASync LSN: %" PRIu64 "\n"
+    "Flush Sync  LSN: %" PRIu64 "\n"
+    "-------------------",
+    lsn, clsn, buf_flush_async_lsn.load(), buf_flush_sync_lsn.load());
+
+  lsn_t age= lsn - clsn;
+  lsn_t age_pct= log_sys.max_checkpoint_age
+      ? age * 100 / log_sys.max_checkpoint_age : 0;
+  sql_print_information("Innodb: LSN age parameters\n"
+    "-------------------\n"
+    "Current Age   : %" PRIu64 " : %" PRIu64 "%%\n"
+    "Max Age(Async): %" PRIu64 "\n"
+    "Max Age(Sync) : %" PRIu64 "\n"
+    "Capacity      : %" PRIu64 "\n"
+    "-------------------",
+    age, age_pct, log_sys.max_modified_age_async, log_sys.max_checkpoint_age,
+    log_sys.log_capacity);
+
+  sql_print_information("Innodb: Pending IO count\n"
+    "-------------------\n"
+    "Pending  Read: %zu\n"
+    "Pending Write: %zu\n"
+    "-------------------",
+    os_aio_pending_reads_approx(), os_aio_pending_writes_approx());


I’d still prefer packing as much information per line as possible, and I would omit any math that can be calculated from the numbers afterwards. These would include dirty_pct, age, and age_pct. Maybe with that, we could have all LSN information in one line:

sql_print_information("InnoDB: LSN " LSN_PF "; checkpoint " LSN_PF ", flush " LSN_PF "," LSN_PF "; max age " LSN_PF "," LSN_PF "; capacity " LSN_PF, lsn, clsn, buf_flush_async_lsn.load(), buf_flush_sync_lsn.load(), log_sys.max_modified_age_async, log_sys.max_checkpoint_age, log_sys.log_capacity); sql_print_information("InnoDB: Pending page reads: %zu, writes: %zu", os_aio_pending_reads_approx(), os_aio_pending_writes_approx());

These numbers are not saying much to an end user anyway; a subject matter expert would be needed for decoding this, and the code that produces it is open source. So, we would not seem to lose that much by having slightly less verbose output. Having more data per line would make it easier to filter the output with tools like grep, sed, awk or perl.

mariadb-DebarunBanerjee self-assigned this Mar 10, 2025

dr-m reviewed Mar 10, 2025

View reviewed changes

svoj added the MariaDB Corporation label Mar 10, 2025

dr-m reviewed Mar 11, 2025

View reviewed changes

storage/innobase/buf/buf0flu.cc Outdated Show resolved Hide resolved

mariadb-DebarunBanerjee force-pushed the 10.6-MDEV-36226 branch from e190624 to de96813 Compare March 12, 2025 14:12

dr-m reviewed Mar 13, 2025

View reviewed changes

mariadb-DebarunBanerjee force-pushed the 10.6-MDEV-36226 branch from de96813 to 3ab4934 Compare March 18, 2025 10:12

dr-m reviewed Mar 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MDEV-36226 Stall and crash when page cleaner fails to generate free p… #3885

MDEV-36226 Stall and crash when page cleaner fails to generate free p… #3885

mariadb-DebarunBanerjee commented Mar 10, 2025

CLAassistant commented Mar 10, 2025

dr-m left a comment

dr-m Mar 13, 2025

dr-m Mar 13, 2025

dr-m left a comment

dr-m Mar 21, 2025

dr-m Mar 21, 2025

dr-m Mar 21, 2025

dr-m Mar 21, 2025

dr-m Mar 21, 2025

dr-m Mar 21, 2025

dr-m Mar 21, 2025

MDEV-36226 Stall and crash when page cleaner fails to generate free p… #3885

Are you sure you want to change the base?

MDEV-36226 Stall and crash when page cleaner fails to generate free p… #3885

Conversation

mariadb-DebarunBanerjee commented Mar 10, 2025

Description

Release Notes

How can this PR be tested?

Basing the PR against the correct MariaDB version

PR quality check

CLAassistant commented Mar 10, 2025

dr-m left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dr-m left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment