Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

gh-113993: Allow interned strings to be mortal, and fix related issues #120520

Merged
merged 79 commits into from
Jun 21, 2024

Conversation

encukou
Copy link
Member

@encukou encukou commented Jun 14, 2024

I've spent too much time looking at this myself, it wants more eyes :)

I spent a week learning about the string interning mechanism, and wrote up how I think it should work in an InternalDocs file I'm adding here.

I found a bunch of ... quirks if not outright bugs. For example, we have duplicate singletons (e.g. _Py_ID(a) and the latin1 short string a). I don't think I can bring back mortal interned strings without getting my idea of the de# sync with the code, so, this ended up being a big PR.


  • Add an InternalDocs file describing how interning should work and how to use it.
    (Please review this first!)

  • Add internal functions to explicitly request what kind of interning is done:

    • _PyUnicode_InternMortal
    • _PyUnicode_InternImmortal
    • _PyUnicode_InternStatic
  • Switch uses of PyUnicode_InternInPlace to those.

  • Disallow using _Py_SetImmortal on strings directly.
    You should use _PyUnicode_InternImmortal instead:

    • Strings should be interned before immortalization, otherwise you're possibly
      interning a immortalizing copy.
    • _Py_SetImmortal doesn't handle the SSTATE_INTERNED_MORTAL to
      SSTATE_INTERNED_IMMORTAL update, and those flags can't be changed in
      backports, as they are now part of public API and version-specific ABI.
  • Add private _only_immortal argument for sys.XXX, used in refleak test machinery.

  • Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:

    • _Py_ID
    • _Py_STR (including the empty string)
    • one-character latin-1 singletons

    Now, when you intern a singleton, that exact singleton will be interned.

  • Add a _Py_LATIN1_CHR macro, use it instead of _Py_ID/_Py_STR for one-character latin-1 singletons everywhere (including Clinic).

  • Intern _Py_STR singletons at startup.

    Try this in 3.12: (click to expand)
    import sys
    
    a = sys.intern('<module>')  # normal string
    print('a', id(a), sys.getrefcount(a))
    try:
        raise Exception()
    except Exception as err:
        b = err.__traceback__.tb_frame.f_code.co_name  # same string via _Py_STR
    
    assert sys.intern(a) is sys.intern(b)

    In 3.13 the reproducer doesn't work but I don't think the underlying unsoundness was fixed.

  • For free-threaded builds, intern _Py_LATIN1_CHR singletons at startup.

  • Beef up the tests. Cover internal details (marked with @cpython_only).

  • Add lots of assertions

encukou added 30 commits June 14, 2024 21:24
Also, the `PyUnicode_InternImmortal` function (with a public-looking name)
is switched to use this.

(I picked a relatively inconsequential module on purpose.)
AFAIUI, this is now handled by the `statically_allocated` flag.
…rned

Mortal interned strings don't count references from the interned_dict in
their refcount. This menans a -2 at interning time, a +2 in ClearInterned,
and special handling in unicode_dealloc.

Note that unicode_dealloc will currently immortalize an interned string.
That means we shouldn't get refleaks *yet*.
bpo-40521 (pythonGH-84701) is now long solved.
It's marshal and the compiler that immortalize strings, not the
code object.
@bedevere-bot bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Jun 21, 2024
@encukou encukou added needs backport to 3.12 only security fixes needs backport to 3.13 bugs and security fixes labels Jun 21, 2024
@encukou
Copy link
Member Author

encukou commented Jun 21, 2024

The buildbot failures are unrelated; I'll merge.

Thank you for the reviews!

@encukou encukou merged commit 6f1d448 into python:main Jun 21, 2024
126 of 129 checks passed
@miss-islington-app
Copy link

Thanks @encukou for the PR 🌮🎉.. I'm working now to backport this PR to: 3.12, 3.13.
🐍🍒⛏🤖

@encukou encukou deleted the immortal-interned branch June 21, 2024 15:19
@miss-islington-app
Copy link

Sorry, @encukou, I could not cleanly backport this to 3.13 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 6f1d448bc110633eda110310fd833bd46e7b30f2 3.13

@miss-islington-app
Copy link

Sorry, @encukou, I could not cleanly backport this to 3.12 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 6f1d448bc110633eda110310fd833bd46e7b30f2 3.12

@ericsnowcurrently
Copy link
Member

Thanks for taking care of this, @encukou!

encukou added a commit to encukou/cpython that referenced this pull request Jun 24, 2024
…related issues (pythonGH-120520)

* Add an InternalDocs file describing how interning should work and how to use it.

* Add internal functions to *explicitly* request what kind of interning is done:
  - `_PyUnicode_InternMortal`
  - `_PyUnicode_InternImmortal`
  - `_PyUnicode_InternStatic`

* Switch uses of `PyUnicode_InternInPlace` to those.

* Disallow using `_Py_SetImmortal` on strings directly.
  You should use `_PyUnicode_InternImmortal` instead:
  - Strings should be interned before immortalization, otherwise you're possibly
    interning a immortalizing copy.
  - `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to
    `SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in
    backports, as they are now part of public API and version-specific ABI.

* Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery.

* Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:
  - `_Py_ID`
  - `_Py_STR` (including the empty string)
  - one-character latin-1 singletons

  Now, when you intern a singleton, that exact singleton will be interned.

* Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic).

* Intern `_Py_STR` singletons at startup.

* For free-threaded builds, intern `_Py_LATIN1_CHR` singletons at startup.

* Beef up the tests. Cover internal details (marked with `@cpython_only`).

* Add lots of assertions

Co-Authored-By: Eric Snow <ericsnowcurrently@gmail.com>
@bedevere-app
Copy link

bedevere-app bot commented Jun 24, 2024

GH-120945 is a backport of this pull request to the 3.13 branch.

@bedevere-app bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Jun 24, 2024
encukou added a commit that referenced this pull request Jun 24, 2024
…d issues (GH-120520) (GH-120945)

* Add an InternalDocs file describing how interning should work and how to use it.

* Add internal functions to *explicitly* request what kind of interning is done:
  - `_PyUnicode_InternMortal`
  - `_PyUnicode_InternImmortal`
  - `_PyUnicode_InternStatic`

* Switch uses of `PyUnicode_InternInPlace` to those.

* Disallow using `_Py_SetImmortal` on strings directly.
  You should use `_PyUnicode_InternImmortal` instead:
  - Strings should be interned before immortalization, otherwise you're possibly
    interning a immortalizing copy.
  - `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to
    `SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in
    backports, as they are now part of public API and version-specific ABI.

* Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery.

* Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:
  - `_Py_ID`
  - `_Py_STR` (including the empty string)
  - one-character latin-1 singletons

  Now, when you intern a singleton, that exact singleton will be interned.

* Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic).

* Intern `_Py_STR` singletons at startup.

* For free-threaded builds, intern `_Py_LATIN1_CHR` singletons at startup.

* Beef up the tests. Cover internal details (marked with `@cpython_only`).

* Add lots of assertions

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
mrahtz pushed a commit to mrahtz/cpython that referenced this pull request Jun 30, 2024
… issues (pythonGH-120520)


* Add an InternalDocs file describing how interning should work and how to use it.

* Add internal functions to *explicitly* request what kind of interning is done:
  - `_PyUnicode_InternMortal`
  - `_PyUnicode_InternImmortal`
  - `_PyUnicode_InternStatic`

* Switch uses of `PyUnicode_InternInPlace` to those.

* Disallow using `_Py_SetImmortal` on strings directly.
  You should use `_PyUnicode_InternImmortal` instead:
  - Strings should be interned before immortalization, otherwise you're possibly
    interning a immortalizing copy.
  - `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to
    `SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in
    backports, as they are now part of public API and version-specific ABI.

* Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery.

* Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:
  - `_Py_ID`
  - `_Py_STR` (including the empty string)
  - one-character latin-1 singletons

  Now, when you intern a singleton, that exact singleton will be interned.

* Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic).

* Intern `_Py_STR` singletons at startup.

* For free-threaded builds, intern `_Py_LATIN1_CHR` singletons at startup.

* Beef up the tests. Cover internal details (marked with `@cpython_only`).

* Add lots of assertions

Co-Authored-By: Eric Snow <ericsnowcurrently@gmail.com>
@eduardo-elizondo
Copy link
Contributor

This is great! Quick follow-up @encukou, should we rebase this one now: #113601?

noahbkim pushed a commit to hudson-trading/cpython that referenced this pull request Jul 11, 2024
… issues (pythonGH-120520)


* Add an InternalDocs file describing how interning should work and how to use it.

* Add internal functions to *explicitly* request what kind of interning is done:
  - `_PyUnicode_InternMortal`
  - `_PyUnicode_InternImmortal`
  - `_PyUnicode_InternStatic`

* Switch uses of `PyUnicode_InternInPlace` to those.

* Disallow using `_Py_SetImmortal` on strings directly.
  You should use `_PyUnicode_InternImmortal` instead:
  - Strings should be interned before immortalization, otherwise you're possibly
    interning a immortalizing copy.
  - `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to
    `SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in
    backports, as they are now part of public API and version-specific ABI.

* Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery.

* Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:
  - `_Py_ID`
  - `_Py_STR` (including the empty string)
  - one-character latin-1 singletons

  Now, when you intern a singleton, that exact singleton will be interned.

* Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic).

* Intern `_Py_STR` singletons at startup.

* For free-threaded builds, intern `_Py_LATIN1_CHR` singletons at startup.

* Beef up the tests. Cover internal details (marked with `@cpython_only`).

* Add lots of assertions

Co-Authored-By: Eric Snow <ericsnowcurrently@gmail.com>
estyxx pushed a commit to estyxx/cpython that referenced this pull request Jul 17, 2024
… issues (pythonGH-120520)


* Add an InternalDocs file describing how interning should work and how to use it.

* Add internal functions to *explicitly* request what kind of interning is done:
  - `_PyUnicode_InternMortal`
  - `_PyUnicode_InternImmortal`
  - `_PyUnicode_InternStatic`

* Switch uses of `PyUnicode_InternInPlace` to those.

* Disallow using `_Py_SetImmortal` on strings directly.
  You should use `_PyUnicode_InternImmortal` instead:
  - Strings should be interned before immortalization, otherwise you're possibly
    interning a immortalizing copy.
  - `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to
    `SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in
    backports, as they are now part of public API and version-specific ABI.

* Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery.

* Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:
  - `_Py_ID`
  - `_Py_STR` (including the empty string)
  - one-character latin-1 singletons

  Now, when you intern a singleton, that exact singleton will be interned.

* Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic).

* Intern `_Py_STR` singletons at startup.

* For free-threaded builds, intern `_Py_LATIN1_CHR` singletons at startup.

* Beef up the tests. Cover internal details (marked with `@cpython_only`).

* Add lots of assertions

Co-Authored-By: Eric Snow <ericsnowcurrently@gmail.com>
@encukou
Copy link
Member Author

encukou commented Jul 22, 2024

@eduardo-elizondo Yup, I've updated it, and left some notes about the docs.
If you want to delegate the writing, let me know and I'll take over the PR :)

@eduardo-elizondo
Copy link
Contributor

@eduardo-elizondo Yup, I've updated it, and left some notes about the docs. If you want to delegate the writing, let me know and I'll take over the PR :)

Fixing them now! Let's close out in the other PR

encukou added a commit to encukou/cpython that referenced this pull request Aug 16, 2024
…related issues (pythonGH-120520)

* Add an InternalDocs file describing how interning should work and how to use it.

* Add internal functions to *explicitly* request what kind of interning is done:
  - `_PyUnicode_InternMortal`
  - `_PyUnicode_InternImmortal`
  - `_PyUnicode_InternStatic`

* Switch uses of `PyUnicode_InternInPlace` to those.

* Disallow using `_Py_SetImmortal` on strings directly.
  You should use `_PyUnicode_InternImmortal` instead:
  - Strings should be interned before immortalization, otherwise you're possibly
    interning a immortalizing copy.
  - `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to
    `SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in
    backports, as they are now part of public API and version-specific ABI.

* Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery.

* Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:
  - `_Py_ID`
  - `_Py_STR` (including the empty string)
  - one-character latin-1 singletons

  Now, when you intern a singleton, that exact singleton will be interned.

* Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic).

* Intern `_Py_STR` singletons at startup.

* Beef up the tests. Cover internal details (marked with `@cpython_only`).

* Add lots of assertions

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
Yhg1s pushed a commit that referenced this pull request Sep 27, 2024
…H-121903, GH-122303) (#123065)

This backports several PRs for gh-113993, making interned strings mortal so they can be garbage-collected when no longer needed.

* Allow interned strings to be mortal, and fix related issues (GH-120520)

  * Add an InternalDocs file describing how interning should work and how to use it.

  * Add internal functions to *explicitly* request what kind of interning is done:
    - `_PyUnicode_InternMortal`
    - `_PyUnicode_InternImmortal`
    - `_PyUnicode_InternStatic`

  * Switch uses of `PyUnicode_InternInPlace` to those.

  * Disallow using `_Py_SetImmortal` on strings directly.
    You should use `_PyUnicode_InternImmortal` instead:
    - Strings should be interned before immortalization, otherwise you're possibly
      interning a immortalizing copy.
    - `_Py_SetImmortal` doesn't handle the `SSTATE_INTERNED_MORTAL` to
      `SSTATE_INTERNED_IMMORTAL` update, and those flags can't be changed in
      backports, as they are now part of public API and version-specific ABI.

  * Add private `_only_immortal` argument for `sys.getunicodeinternedsize`, used in refleak test machinery.

   Make sure the statically allocated string singletons are unique. This means these sets are now disjoint:
    - `_Py_ID`
    - `_Py_STR` (including the empty string)
    - one-character latin-1 singletons

    Now, when you intern a singleton, that exact singleton will be interned.

  * Add a `_Py_LATIN1_CHR` macro, use it instead of `_Py_ID`/`_Py_STR` for one-character latin-1 singletons everywhere (including Clinic).

  * Intern `_Py_STR` singletons at startup.

  * Beef up the tests. Cover internal details (marked with `@cpython_only`).

  * Add lots of assertions

* Don't immortalize in PyUnicode_InternInPlace; keep immortalizing in other API (GH-121364)

  * Switch PyUnicode_InternInPlace to _PyUnicode_InternMortal, clarify docs

  * Document immortality in some functions that take `const char *`

  This is PyUnicode_InternFromString;
  PyDict_SetItemString, PyObject_SetAttrString;
  PyObject_DelAttrString; PyUnicode_InternFromString;
  and the PyModule_Add convenience functions.

  Always point out a non-immortalizing alternative.

  * Don't immortalize user-provided attr names in _ctypes

* Immortalize names in code objects to avoid crash (GH-121903)

* Intern latin-1 one-byte strings at startup (GH-122303)

There are some 3.12-specific changes, mainly to allow statically allocated strings in deepfreeze. (In 3.13, deepfreeze switched to the general `_Py_ID`/`_Py_STR`.)

Co-authored-by: Eric Snow <ericsnowcurrently@gmail.com>
@@ -0,0 +1,122 @@
# String interning
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have a link to this file from the index in InternalDocs/README.md.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants