Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

MDEV-36226 Stall and crash when page cleaner fails to generate free p… #3885

Open
wants to merge 1 commit into
base: 10.6
Choose a base branch
from

Conversation

mariadb-DebarunBanerjee
Copy link
Contributor

  • The Jira issue number for this PR is: MDEV-36226

Description

During regular iteration the page cleaner does flush from flush list with some flush target and then goes for generating free pages from LRU tail. When asynchronous flush is triggered i.e. when 7/8 th of the LSN margin is filled in the redo log, the flush target for flush list is set to innodb_io_capacity_max. If it could flush all, the flush bandwidth for LRU flush is currently set to zero. If the LRU tail has dirty pages, page cleaner ends up freeing no pages in one iteration. The scenario could repeat across multiple iterations till async flush target is reached. During this time the DB system is starved of free pages resulting in apparent stall and in some cases dict_sys latch fatal error.

Fix: In page cleaner iteration, before LRU flush, ensure we provide enough flush limit so that freeing pages is no blocked by dirty pages in LRU tail. Log IO and flush state if double write flush wait is long.

Impact: It could result in increased IO due to LRU flush in specific cases.

Release Notes

None

How can this PR be tested?

Regular Innodb test should cover the path. Performance and stress Test should be run to judge for possible impact.

Reproducing the base issue would require large buffer pool, long run and synchronization between foreground and Innodb background threads.

Basing the PR against the correct MariaDB version

  • This is a new feature or a refactoring, and the PR is based against the main branch.
  • This is a bug fix, and the PR is based against the earliest maintained branch in which the bug can be reproduced.

PR quality check

  • I checked the CODING_STANDARDS.md file and my PR conforms to this where appropriate.
  • For any trivial modifications to the PR, I am ok with the reviewer making the changes themselves.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change to the logic looks reasonable to me, but in the diagnostic output I’d avoid excessive numbers of calls to operator<<() and use the common logging functions sql_print_warning() or sql_print_information().

Comment on lines 2697 to 2704
sql_print_information("Innodb: LSN flush parameters\n"
"-------------------\n"
"System LSN : %" PRIu64 "\n"
"Checkpoint LSN: %" PRIu64 "\n"
"Flush ASync LSN: %" PRIu64 "\n"
"Flush Sync LSN: %" PRIu64 "\n"
"-------------------",
lsn, clsn, buf_flush_async_lsn, buf_flush_sync_lsn);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually all InnoDB messages are prefixed by InnoDB: (note the case). Do we need this many rows for the output? You need to write uint64_t{buf_flush_async_lsn} or similar to avoid compilation errors:

error: cannot pass object of non-trivial type 'Atomic_relaxed<lsn_t>' (aka 'Atomic_relaxed<unsigned long long>') through variadic function; call will abort at runtime [-Wnon-pod-varargs]

ulint lru_size= UT_LIST_GET_LEN(LRU);
ulint dirty_size= UT_LIST_GET_LEN(flush_list);
ulint free_size= UT_LIST_GET_LEN(free);
ulint dirty_pct= lru_size ? dirty_size * 100 / (lru_size + free_size) : 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dirty_pct seems to be redundant information that can be calculated from the rest. It could also be totally misleading, because we were reading these fields without proper mutex or flush_list_mutex protection.

…ages during Async flush

During regular iteration the page cleaner does flush from flush list
with some flush target and then goes for generating free pages from LRU
tail. When asynchronous flush is triggered i.e. when 7/8 th of the LSN
margin is filled in the redo log, the flush target for flush list is
set to innodb_io_capacity_max. If it could flush all, the flush
bandwidth for LRU flush is currently set to zero. If the LRU tail has
dirty pages, page cleaner ends up freeing no pages in one iteration.
The scenario could repeat across multiple iterations till async flush
target is reached. During this time the DB system is starved of free
pages resulting in apparent stall and in some cases dict_sys latch
fatal error.

Fix: In page cleaner iteration, before LRU flush, ensure we provide
enough flush limit so that freeing pages is no blocked by dirty pages
in LRU tail. Log IO and flush state if double write flush wait is long.

Reviewed by: Marko Mäkelä
Copy link
Contributor

@dr-m dr-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic looks OK to me, but I would like to see an update from our performance testing team. Hopefully they will have some luck reproducing the original issue and validating this fix, which is basically based on an educated guess and a careful review of the existing logic.

Please consider my suggestions about the formatting and the timeout logic in buf_dblwr_t::flush_buffered_writes(). There seems to be a format string mismatch that affects 64-bit Microsoft Windows.

Comment on lines +4467 to 4470
qr/InnoDB: Long wait \([0-9]+ seconds\) for double-write buffer flush/,
qr/Slave: Unknown table 't1' .* 1051/,
qr/Slave SQL:.*(Internal MariaDB error code: [[:digit:]]+|Query:.*)/,
qr/slave SQL thread aborted/,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The added pattern seems to be incorrect and should therefore not have the desired effect of preventing sporadic test failures in a heavily loaded environment. The regular expression group delimiters () are metacharacters in Perl regular expressions, as you can see 2 lines below the addition.

Comment on lines +610 to +618
sql_print_information("Innodb: Double Write State\n"
"-------------------\n"
"Batch running : %s\n"
"Active Slot - first_free: %lu reserved: %lu\n"
"Flush Slot - first_free: %lu reserved: %lu\n"
"-------------------",
(batch_running ? "true" : "false"),
active_slot->first_free, active_slot->reserved,
flush_slot->first_free, flush_slot->reserved);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error log messages should start with InnoDB: in the correct capitalization. This output could be packed more tightly. I am not sure that we should start outputting multi-line output without starting each line with a timestamp. Even the Linux kernel dmesg prefixes each line with a timestamp, also when outputting structured output, such as a process table when the out-of-memory killer kicks in. I would suggest the following:

  sql_print_information("InnoDB: doublewrite %s (active: %zu, %zu) (flush: %zu, %zu)",
                        batch_running ? "active" : "idle",
                        active_slot->first_free, active_slot->reserved,
                        flush_slot->first_free, flush_slot->reserved);

There is a type mismatch between %lu and size_t a.k.a. ulint. On LLP64 targets, such as Microsoft Windows, sizeof(unsigned long) would be 4 while sizeof(size_t) is 8. The correct format is %zu or " ULINTPF ".

More verbose output is not necessarily any more readable to an end user. When analyzing the output, it should not be much of an extra hurdle to consult the source code for decoding the output.

Comment on lines +626 to +631
const ulong max_count= 60 * 60;
const ulong first_log_count= 30;
const ulong fatal_threshold= srv_fatal_semaphore_wait_threshold;
ulong log_count= first_log_count;

for (ulong count= 0;;)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any use of the types ulong, unsigned long or long should be generally avoided, because it is introducing an unnecessary incompatibility between LP64 and LLP64 targets. I would suggest unsigned or size_t.

Comment on lines -616 to +640
my_cond_wait(&cond, &mutex.m_mutex);
}

timespec abstime;
set_timespec(abstime, 1);
my_cond_timedwait(&cond, &mutex.m_mutex, &abstime);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn’t we avoid the timeout handling when my_cond_timedwait() returns 0?

  if (!my_cond_timedwait(&cond, &mutex.m_mutex, &abstime))
  {
    /* buf_dblwr_t::write_completed()` had been invoked */
    count= 0;
    continue;
  }

Can we set a longer timeout than 1 second? Could it be somehow based on srv_fatal_semaphore_wait_threshold, in such a way that one of the subsequent conditions on count will always hold? It would seem that we could choose the limit to be std::min(5UL, srv_fatal_semaphore_wait_threshold).

Comment on lines +1771 to +1775
/* If we have exhausted flush quota, it is likely we exited before
generating enough free pages. Call once more with 0 flush to generate
free pages immediately as required. */
if (pages == max_n)
buf_do_LRU_batch(0, &n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For safety, could the condition be pages >= max_n?

Comment on lines +2687 to +2693
sql_print_information("Innodb: Buffer Pool pages \n"
"-------------------\n"
"LRU Pages: %lu\n"
"Free Pages: %lu\n"
"Dirty Pages: %lu : %lu%%\n"
"-------------------",
lru_size, free_size, dirty_size, dirty_pct);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I pointed out about buf_dblwr_t::print_info(), I am not a friend of this kind of a multi-line output. I’m not sure if we should extend the SHOW ENGINE INNODB STATUS type spamming in the server error log. Maybe we can argue that it this is output very rarely, and having a more verbose and structured looking output would make it stand out from the server error log.

Because this very often should precede a crash report, it would be natural to model this after output_core_info(). It is outputting various things (also multi-line output) with my_safe_printf_stderr(), which turns out to be a simple wrapper around my_vsnprintf() and my_write_stderr(), which in turn is simply invoking write(2). I assume that we need this in a signal handler, because a signal could be sent while the object stderr is in an inconsistent state. Because we’re outside any signal handling code here, it could be simplest to invoke fprintf(stderr, …) like you did in your original revision of this. (Sorry, I should have checked this earlier.)

It seems that each call of buf_pool.print_flush_info() is immediately followed by a timestamp and InnoDB:. Therefore, it could be simplest to just have a few fprintf(stderr, …) here, without any timestamp or InnoDB prefix.0

Comment on lines +2697 to +2724
sql_print_information("Innodb: LSN flush parameters\n"
"-------------------\n"
"System LSN : %" PRIu64 "\n"
"Checkpoint LSN: %" PRIu64 "\n"
"Flush ASync LSN: %" PRIu64 "\n"
"Flush Sync LSN: %" PRIu64 "\n"
"-------------------",
lsn, clsn, buf_flush_async_lsn.load(), buf_flush_sync_lsn.load());

lsn_t age= lsn - clsn;
lsn_t age_pct= log_sys.max_checkpoint_age
? age * 100 / log_sys.max_checkpoint_age : 0;
sql_print_information("Innodb: LSN age parameters\n"
"-------------------\n"
"Current Age : %" PRIu64 " : %" PRIu64 "%%\n"
"Max Age(Async): %" PRIu64 "\n"
"Max Age(Sync) : %" PRIu64 "\n"
"Capacity : %" PRIu64 "\n"
"-------------------",
age, age_pct, log_sys.max_modified_age_async, log_sys.max_checkpoint_age,
log_sys.log_capacity);

sql_print_information("Innodb: Pending IO count\n"
"-------------------\n"
"Pending Read: %zu\n"
"Pending Write: %zu\n"
"-------------------",
os_aio_pending_reads_approx(), os_aio_pending_writes_approx());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’d still prefer packing as much information per line as possible, and I would omit any math that can be calculated from the numbers afterwards. These would include dirty_pct, age, and age_pct. Maybe with that, we could have all LSN information in one line:

  sql_print_information("InnoDB: LSN " LSN_PF "; checkpoint " LSN_PF
                        ", flush " LSN_PF "," LSN_PF
                        "; max age " LSN_PF "," LSN_PF
                        "; capacity " LSN_PF,
                        lsn, clsn,
                        buf_flush_async_lsn.load(), buf_flush_sync_lsn.load(),
                        log_sys.max_modified_age_async,
                        log_sys.max_checkpoint_age,
                        log_sys.log_capacity);
  sql_print_information("InnoDB: Pending page reads: %zu, writes: %zu",
                        os_aio_pending_reads_approx(), os_aio_pending_writes_approx());

These numbers are not saying much to an end user anyway; a subject matter expert would be needed for decoding this, and the code that produces it is open source. So, we would not seem to lose that much by having slightly less verbose output. Having more data per line would make it easier to filter the output with tools like grep, sed, awk or perl.

# for free to join this conversation on GitHub. Already have an account? # to comment
Development

Successfully merging this pull request may close these issues.

4 participants