Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

gh-112532: Improve mimalloc page visiting #114133

Merged
merged 1 commit into from
Jan 22, 2024
Merged

Conversation

colesbury
Copy link
Contributor

@colesbury colesbury commented Jan 16, 2024

This adds support for visiting abandoned pages in mimalloc and improves the performance of the page visiting code. Abandoned pages contain memory blocks from threads that have exited. At some point, they may be later reclaimed by other threads. We still need to visit those pages in the free-threaded GC because they contain live objects.

This also reduces the overhead of visiting mimalloc pages:

  • Special cases for full, empty, and pages containing only a single block.
  • Fix free_map to use one bit instead of one byte per block.
  • Use fast integer division by a constant algorithm when computing block offset from block size and index.
  • Faster looping over bitmap.
  • Fix bug where blocks in the delayed-free list were erroneously visited as if they were live.

This adds support for visiting abandoned pages in mimalloc and improves
the performance of the page visiting code. Abandoned pages contain
memory blocks from threads that have exited. At some point, they may be
later reclaimed by other threads. We still need to visit those pages in
the free-threaded GC because they contain live objects.

This also reduces the overhead of visiting mimalloc pages:

 * Special cases for full, empty, and pages containing only a single
   block.
 * Fix free_map to use one bit instead of one byte per block.
 * Use fast integer division by a constant algorithm when computing
   block offset from block size and index.
Copy link
Contributor

@DinoV DinoV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious about the unused function(s), otherwise LGTM!

}

// Visit all blocks in a abandoned segments
bool _mi_abandoned_pool_visit_blocks(mi_abandoned_pool_t* pool, uint8_t page_tag, bool visit_blocks, mi_block_visit_fun* visitor, void* arg) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and therefore the previous 2 functions) doesn't seem to be used anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These will be used in the upcoming GC PR. Here is an example usage:

https://github.com/colesbury/cpython/blob/8314c7c1d9d9670d4a83b9dc12f23611493c8eaa/Python/gc_free_threading.c#L226-L227

I put them in this PR because:

  1. Keeping the mimalloc changes separate makes them a bit easier to track and upstream
  2. The GC PR will be big and doing this first makes the upcoming PR a bit smaller

@colesbury
Copy link
Contributor Author

@DinoV, would you please merge this when you are ready?

@DinoV DinoV merged commit 412920a into python:main Jan 22, 2024
@colesbury colesbury deleted the 112532-visitor branch January 22, 2024 21:14
aisk pushed a commit to aisk/cpython that referenced this pull request Feb 11, 2024
This adds support for visiting abandoned pages in mimalloc and improves
the performance of the page visiting code. Abandoned pages contain
memory blocks from threads that have exited. At some point, they may be
later reclaimed by other threads. We still need to visit those pages in
the free-threaded GC because they contain live objects.

This also reduces the overhead of visiting mimalloc pages:

 * Special cases for full, empty, and pages containing only a single
   block.
 * Fix free_map to use one bit instead of one byte per block.
 * Use fast integer division by a constant algorithm when computing
   block offset from block size and index.
daanx added a commit to microsoft/mimalloc that referenced this pull request Jun 3, 2024
daanx added a commit to microsoft/mimalloc that referenced this pull request Jun 3, 2024
daanx added a commit to microsoft/mimalloc that referenced this pull request Jun 3, 2024
daanx added a commit to microsoft/mimalloc that referenced this pull request Jun 3, 2024
Glyphack pushed a commit to Glyphack/cpython that referenced this pull request Sep 2, 2024
This adds support for visiting abandoned pages in mimalloc and improves
the performance of the page visiting code. Abandoned pages contain
memory blocks from threads that have exited. At some point, they may be
later reclaimed by other threads. We still need to visit those pages in
the free-threaded GC because they contain live objects.

This also reduces the overhead of visiting mimalloc pages:

 * Special cases for full, empty, and pages containing only a single
   block.
 * Fix free_map to use one bit instead of one byte per block.
 * Use fast integer division by a constant algorithm when computing
   block offset from block size and index.
clrpackages pushed a commit to clearlinux-pkgs/mimalloc that referenced this pull request Jan 16, 2025
…3.0.1

ArtSin (1):
      Fix int and long handling and the use of (u)intptr_t in _mi_vsnprintf

Daan (65):
      bump version to v1.8.8 for further development
      typo in stress test
      fix pthread initalization of mutexes
      fix c++ compilation
      decrease meta allocation zone to 4k (to reduce .bss)
      increase thread data cache to32 entries
      whitespace
      remove old mi_abandoned_await_readers
      re-enable tsan test in azure pipelines
      add reference to page_malloc_zero in C++ build
      increase iterations for tsan test
      reduce UBSAN parameters to stay within pipeline limits
      rename arena-abandoned to arena-abandon
      update aligned documentation
      add js for docs
      add search js files for docs
      add docs svg's
      add heap tag to area descriptor
      update docs
      update docs
      fix count/size order in mi_heap_alloc_new_n, issue #906
      initial work on guarded objects
      fix UINT32_MAX constant (see issue #913)
      set lower parameters for guarded test
      add guarded build to test pipeline
      increase test timeout for azure pipeline
      increase TSAN test to 400 iterations
      add cmake option to add C pre processor definitions more easily
      allow certain options to have defaults set via the pre-processor at build time -- see issue #945
      add test for issue #944
      fix MI_EXTRA_CPPDEFS setting
      reorganize primitives for process initialization; use special data segment on Windows for thread termination by default on Windows now (issue #869)
      add cmake option to fall back on the fiber api do detect thread termination on windows
      fix build on windows
      fix duplicate definition on windows
      fix win32 compilation
      fix fast divisor for 32-bit platforms
      cleanup process init/done
      fix issue where searching for abandoned blocks would skip the first one
      add missing mi_thread_done definition
      improve windows static library initialization to account for thread local destructors (issue #944)
      fix assertion check
      do not reclaim segments if free-ing from a thread with an already abandoned heap (issue #944)
      update mimalloc redirect to v1.2 to handle static destructors that free memory (issue #944)
      update mimalloc-redirect
      update comments, set constructor priority to 101 on macOS
      add 0 byte to canary to prevent spurious read overflow to read the canary (issue #951, pr #953)
      disable aligned hinting or SV39 mmu's, issue #939, and pr #949
      remove wrong assertion
      update test file
      update mimalloc-redirect to potentially fix issue #957
      allow build time setting of sample rate
      small fixes for macOS
      various fixes for test pipeline
      fix debug build of MI_GUARDED
      fix missing void
      fix macos 15 OS name
      temporarily add macOS 13 and 12 for testing
      fix for macOS 14 and earlier
      use non-null tld in heap_init
      fix assertion
      fix TLS slot on macOS
      add neon code for bit clear
      add neon version for chunk_is_clear
      Update readme.md to fix links (issue #978)

Daan Leijen (61):
      add initial primitive api for locks
      move lock code to atomic.h
      fix warnings
      shuffle for 128 bit
      set compile as C++ in VS IDE
      clean up guarded allocation
      add comments
      fix use_guarded signature
      use enqueue_from_full, and keep inserting at the end
      fix std malloc compile of the stress test
      add windows arm64 target to vs2022
      add redirection dll for windows on arm64
      add minject for windows arm64
      add Windows arm64 support in cmame; name the mimalloc dll 'mimalloc-override.dll' on Windows with cmake (to match the IDE and minject
      update readme
      update arm64 redirection
      testing on arm64
      make timeout for tests in the pipeline up to 4 min
      better stats for commit on overcommit systems (by not counting on-demand commit upfront)
      add support for arm64ec
      update redirection modules to v1.3
      add _base test for redirection
      update redirection readme
      fix cmake for visual studio on arm64
      update readme for cmake on windows
      add link for VS generator
      revert back to generating mimalloc.dll instead of mimalloc-override.dll
      don't prefer high used candidate if it is too full
      update IDE settings to match cmake output; in particular mimalloc-override.dll -> mimalloc.dll
      add updated minject v1.2 that defaults to mimalloc.dll instead of mimalloc-override.dll
      update readme to use mimalloc.dll (instead of mimalloc-override.dll)
      fix cmake to generate mimalloc.dll on windows
      don't override a page candidate with a page that is too full
      insert full pages at the end of the queue; only override page candidate if the page is not too full
      fix build
      fix max va bits on unix
      fix issue #976
      fix initializer warning on clang-18
      rename segment_map_destroy to segment_map_unsafe_destroy
      add filters for vs projects
      remove older vs projects as they became stale
      update vs project filter
      avoid accessing heap->tld after running thread_done (pr #974)
      fix potentially warning on gcc (pr #935)
      add newline
      fix alignment for mi_manage_os_memory (issue #980)
      add thread_local for c++
      disable large pages by default
      fix signedness warning
      fix initialization warning on gcc
      combine flags and xthread_id
      nicer logic in free
      merge from dev3-bin
      update to v1.8.8
      bump version to 3.0.0
      allow large OS pages on Linux by default (but not on Android)
      fix link in readme
      bump version to 3.0.1 for further development
      bump version to 1.8.9 for further development
      fix large OS page behaviour on Linux; default is now 2 which only uses large OS pages (not huge) through madvise
      display full version during cmake

Daisuke Fujimura (fd0) (1):
      Build on cygwin

Danny Lin (1):
      Change macOS mmap tag to fix conflict with IOAccelerator

David Carlier (1):
      _mi_memcpy/_mi_memzero: tighten criteria for intrinsics for windows.

Diego Russo (1):
      Fix illegal instruction for older Arm architectures

Ikko Eltociear Ashimine (1):
      docs: update readme.md

Javier Blazquez (1):
      free segment map when destroy_on_exit is set

Jim-Wang (1):
      fix build error on linux

Joris van der Geer (1):
      readme - describe how to run under Valgrind with dynamic override

Michael Neumann (1):
      Fix build on FreeBSD-derivate DragonFly

Philip Brown (1):
      Musl needs __libc* functions too

QuarticCat (1):
      fix typos

Rui Ueyama (1):
      Add a missing #include

Zhihua Lai (1):
      Fix typo

daanx (240):
      prevent UB in arena reservation
      fix spelling
      increase max arenas
      add support for sub-processes (to supportpython/cpython#113717)
      add initial support for visiting abandoned segments per subprocess, upstream for python/cpython#114133
      add support to visit _all_ abandoned segment blocks per sub-process, upstream for python/cpython#114133
      optimize heap walks, by Sam Gross, upstream of python/cpython#114133
      fix leak in abandoned block visiting
      only reclaim for exclusive heaps in their associated arena
      revise the segment map to only apply to OS allocated segments and reduce the .BSS footprint
      fix cast; make segment map static
      reduce delayed output from redirection to 16KiB to reduce the .bss size
      use EFAULT if a target heap tag cannot be found on reclaim
      always include sys/prctl.h on linux to disable THP if large_os_pages are not enabled
      switch between OS and arena allocation in stress test
      more aggressive reclaim from free for OS blocks
      revisit atomic reclaim for abandoned segments
      push os abandoned blocks at the tail end
      maintain count of the abandoned os list
      fix leak where OS abandoned blocks were not always reclaimed
      refactor arena abandonment in a separate file
      refactor arena-abandoned to be an include for backward compat with existing build scripts
      fix vs 2022 ide
      don't reset a segment thread id when iterating
      don't reset a segment thread id when iterating
      fix asan tracking by explicitly setting memory to undefined before a free
      fix potential race on subproc field in the segment
      update documentation
      update doxyfile
      add extra assertions to check that blocks are always aligned to MI_MAX_ALIGN_SIZE
      fix alignment test
      initial working guarded pages
      fix multi-threaded free to unprotect guarded blocks
      clean up guarded pages code
      don't consider memory as large OS pages if only madvise'd
      prefer pages that do not expand
      search N pages for a best fit
      insert full pages that became unfull, at the start of the page queue to increase potential reuse
      revert back to unfull at the end of queues as it slows down some benchmarks (like alloc-test1)
      reduce page search to 8
      add virtual address bits and physical memory to the mem config
      add address hint to primitive allocation API
      update guarded implementation to use block tags
      rename mi_debug_guarded_  to mi_guarded_
      add sampling for guarded objects
      add guarded objects that are sampled (and fit a size range). guarded sample rate etc can be set per heap as well as defaulted with options
      fix asan with MI_GUARDED
      update azure pipeline to use sample rate of 1000 for guarded objects
      Extend azure pipeline with Ubuntu 24 & 20, windows 2019, and macOS 15
      fix azure pipeline
      add target_segments_per_thread option
      clean up candidate search; add mi_collect_reduce
      ensure forced abandoned pages can be accessed after free
      wip: initial work on mimalloc3 without segments
      wip: further progress on removing segments
      wip: further progress on segment removal; arena allocation
      wip: further progress on segment removal; arena allocation
      can compile without missing functions
      wip: update any_set
      wip: can run initial test
      wip: bug fixes
      wip: bug fixes
      wip: bug fixes
      wip: add generic find_and_xset
      wip: rename arena blocks to slices
      compile with clang and gcc
      wip
      first version that passes the make test
      pass all debug tests
      bug fixes
      wip: cannot compile
      wip: use epoch with 512bit chunks
      wip: can run mstress
      fix free stats
      add base and size to OS memid
      can run basic test
      can run the full test suite
      revise free reclaim; ensure unown cannot race with a free
      fix assertions
      increase MAX_OBJ_SLICES to a full chunk (32MiB)
      wip: initial large bitmaps
      large bitmaps working; lock on arena_reserve
      small fixes
      more documentation; better pairmap find_and_set_to_busy, busy flag is now 0x10
      small adjustments
      change to full_page_retain
      tune free-ing and abandoning
      initial no more pairmap
      working simplified version without pairmaps and bitmap epoch
      record max_clear bit
      fix page info size and order; atomic page flags
      compile for 32-bit as well
      small fixes
      Add MI_ARCHOPT option to enable architecture specific optimizations
      revise visiting arenas, better bitmap scanning
      Add MI_ARCHOPT support for msvc
      arch specific optimizations
      check heaptag on abandonded page allocation
      specialize bitmap operations for common page sizes
      check for running in a threadpool to disable page reclaim
      only enable architecture specific optimization for armv8.1
      update bit primitives
      fix spelling
      update optimization on haswell
      delete old files
      add dedicated meta data allocation for threads and tld
      comments
      fix write to empty heap in mi_guarded build
      remove os_tld and stats parameters to os interface
      fix bug where only the first chunkmap field would be considered
      set default arena reserve back to 1GiB
      various improvements
      add cast to avoid errors on clang 7
      add cast to avoid errors on clang 7
      fix 32 bit multiply in generic ctz/clz
      add bsf/bsr for compilation with older compilers (clang 7)
      improve generic ctz/clz
      add extra checks for valid pointers in the pagemap, add max_vabits and debug_commit_full_pagemap options
      fix generic ctz/clz
      improve popcount
      fix MI_GUARDED build
      better block alignment
      add asan/ubsan/tsan and valgrind to default debug build
      heap meta data always uses mi_meta_zalloc
      ensure incompatible heaps are not absorbed
      fix comments in types; fix guarded alignment bug
      small updates
      fix build error
      use frac 8 for reclaim_on_free and reabandon; halve full_page_retain if running in a threadpool
      wip: allow arena (re)loading
      maintain pages set for arenas; improve arena load/unload
      space out threads when searching for free pages
      use thread spacing for reclaim as well
      use thread spacing for reclaim as well
      lower full page retain more aggressively in a threadpool
      fix free bug for meta data
      add debug output for page map; free tld on thread exit
      comment
      nicer debug output
      wip: start on purge
      enable purging of free committed slices from arenas
      clean up bitmap api
      fix avx2 bug with atomics
      flexible clearN_ that can start at any index
      fix concurrent mi_tld access bug
      small fixes
      wip: binned bitmap for the free slices
      more bbin size classes, bug fixes
      remove maxaccessed from general bitmaps
      add delay to purg'ing; call collect_retired every N generic allocs
      comments
      comments
      fix infoslices needed calculation
      fix bug in bitmap_forall_ranges
      fix purging with ranges
      atomically clear purge bits when visiting
      update minject to v1.1
      add ajust stats to compensate for double counting
      adjust stats more clearly to avoid double counting commits
      adjust stats more clearly to avoid double counting commits
      update stat adjustment for purging
      update arch detection in cmake
      syntax error
      add comments/doc
      fix MI_ARCH test
      add specialized is_set for 1 bit
      small fixes; max object size  1/8th of a pages
      remove busy wait for arena reservation
      use srw lock on windows
      subprocesses own arena's
      fix lock recursion
      make stats part of a subproc
      merge subproc stats on delete
      track os abandoned pages in a list
      allocate heaps associated with an arena in that arena
      add initial load/unload for heaps
      update lock primitive; fix arena exclusive allocation
      remove req_arena parameter to arena_reserve
      limit purgeing to one purge cycle per purge delay
      fix build error
      limit candidate search to 4
      merge from dev
      re-add deferred free and heap retired collect
      enable collecting from the full page queue
      fix signed/unsigned; fix heap_destroy assert failure
      initial work on a two-level page-map
      fix page_map initialization
      revert back to flat address map
      add -mtune=native with opt arch
      experiment with 2 level pagemap
      improving level 2 page-map
      small fixes
      rename option pagemap_commit; always commit the page map on macos (for now)
      support full secure build
      clean up
      cleanup, some renaming
      cleanup
      old purge delay
      merge from dev3
      fix recursive tls access on macOS <= 14
      document way to use a TLS slot on windows
      add abandoned_visit_blocks
      commit 2level page-map on over-commit systems
      remove is_large member (and use is_pinned for this)
      add _mi_os_guard_page_size
      fix guard page size
      fix purge delay check for arenas
      double arena per 4; large page objects 1/8 of large page size
      max obj size 1/8 of a page
      commit page on demand
      improve page commit on demand
      fix assertion for huge pages
      fix huge page allocation size
      fix rounding issue with huge size allocations
      rename page options
      nice colors for heap maps
      remove is_expandable requirement on page candidates
      fix build warning
      fix page commit-on-demand setting
      commit page-map within one allocation
      wip: merging from upstream
      improve commit stats
      small fixes
      fix debug_show_arenas parameters
      fix constructor re-initialization on subproc_main
      fix c++ initializer warning
      renamed vcxproj
      add comments about TLS
      add attr_noexept for better codegen on msvc
      add declspec hidden to improve codegen on arm64
      use fixed tls on windows with static linking
      merge from dev3
      add comments
      make bitmap scan cross bfields for NX; disable the use of large object pages
      fix debug output
      fix scan of NX
      fix NX test in try_find_and_clearN
      fix pointer alignment for zero-sized large alignment case
      search size bins from small to large
      fix enable large pages
# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants