Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fix _legalize_path types #5224

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Fix _legalize_path types #5224

wants to merge 3 commits into from

Conversation

snejus
Copy link
Member

@snejus snejus commented May 4, 2024

Description

Part 2 of the work fixing types in beets.util.__init__ #5215.

Mypy was not happy here because _legalize_stage function implementation concatenates path and extension parameters, implying that their types need to match.

You can see that initially path parameter was defined as a str while extension was bytes.

In reality, depending on the fragment parameter value, extension was sometimes provided as a str and sometimes as bytes. The same parameter decided whether path gets converted into bytes within _legalize_stage implementation. No surprise that mypy was confused here.

_legalize_stage is only used within Item.destination method implementation which is where fragment is defined. I determined that the fragment parameter controls the form of the output path:

  • fragment=False returned absolute path as bytes (default)
  • fragment=True returned path relative to the library directory as str.

Given the above, the change

  1. Renames fragment parameter to relative_to_libdir for clarity

  2. Makes Item.destination to return the same type in both cases. I picked bytes since that's the type that majority of the code using this method expects.

    I converted the output path to str for the code that has been expecting a string there.

  3. Decouples _legalize_stage and _legalize_path implementations from the relative_to_libdir. The logic now uses str type only.

@snejus snejus self-assigned this May 4, 2024
@snejus snejus requested a review from wisp3rwind May 4, 2024 12:35

This comment was marked as resolved.

@snejus snejus force-pushed the fix-most-types-in-util-init branch 4 times, most recently from 79cf3fd to 6a1f684 Compare May 6, 2024 08:30
@snejus snejus force-pushed the fix-most-types-in-util-init branch 4 times, most recently from 80c3daa to 860daa9 Compare September 13, 2024 22:06
Base automatically changed from fix-most-types-in-util-init to master September 18, 2024 12:40
@snejus snejus force-pushed the fix-legalize-path branch 2 times, most recently from b53b089 to 48f6d71 Compare December 2, 2024 01:00
@snejus
Copy link
Member Author

snejus commented Dec 2, 2024

@wisp3rwind this is ready for a review - trying to revive this work

Copy link
Member

@wisp3rwind wisp3rwind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice seeing some progress here!

I think it's a great observation that there's only really one caller to legalize_path (i.e. Item.destination), which makes it straightforward to adapt the signature of legalizat_path to get rid of all its weirdness. This seems like the right thing to do!

It seems correct that the input to legalize_path should always be Unicode: The source it always a template evaluation, which is performed using Unicode strings, rather than something filesystem-specific or bytes. The output is most naturally bytes, since it's our canonical representation for things supposed to be passed to the filesystem.

At a relatively superficial glance, this PR looks reasonable to me (the one inline comment that I left is probably out-of-scope).

One thing I'm unsure about is the use truncate (but I'm not sure it made sense before, either): If the purpose of truncation is to deal with filesystem limitations, this should model as closely as possible what matters to the underlying filesystem. Previously, truncation could be performed on str or bytes depending on the fragment flag (notably, not depending on platform or filesystem). Now, it's always based on Unicode code points: In one sense, this is an improvement, since the result will always be valid UTF-8. On the other hand, the byte-length of the result is not bounded by the specified length. Ideally, I feel like this should attempt to truncate at Unicode graphemes boundaries, but measure length in terms of encoded bytes or Unicode code points depending on platform/filesystem.

It's also unexpected that the previous implementation used displayable_path in the legalization procedure: That is meant to remove non-printable characters, but does that align with what makes sense to be part of a filename? Maybe it would be interesting to check the git blame whether it has some rationale for its use here?

In any case, the previous solution doesn't seem to be ideal either (or, at least, it lacks sufficient documentation to make sense of it for me), so perfecting the truncation might be worth opening a tracking issue for, but out-of-scope in this PR. Do we have any bug reports that might be related to path truncation and legalization?

beets/library.py Outdated Show resolved Hide resolved
@snejus
Copy link
Member Author

snejus commented Dec 10, 2024

It's also unexpected that the previous implementation used displayable_path in the legalization procedure: That is meant to remove non-printable characters, but does that align with what makes sense to be part of a filename? Maybe it would be interesting to check the git blame whether it has some rationale for its use here?

I followed your suggestion and had a look at the history of this functionality: this PR and this issue may be the most informative. Adrian nicely summarised the point of having to use bytes in this comment:

Thank you! This is related to #496, where truncation was also creating invalid filenames by bypassing the replace rules.

To resolve this, you could imagine doing one of two things:

Do the truncation on pre-encoded, Unicode character strings.
Somehow ensure, after truncation on the byte string, that we trim partial encoded characters.
The trouble with option 1 is that, on Unix, paths are byte strings—so a filesystem's maximum length is specified in bytes, not characters. (Not sure about Windows; it's probably characters there.) And the difficulty with 2 is that finding the character boundaries amounts to re-decoding back to Unicode.

See the rest of the thread around this comment. It seems like this may be less relevant in Python 3 ;)

@snejus
Copy link
Member Author

snejus commented Dec 10, 2024

All of this functionality was written back in the day when some useful libraries did not exist yet. Take pathvalidate, for example, which could be helpful for us here.

Background
  The `_legalize_stage` function was causing issues with Mypy due to
  inconsistent type usage between the `path` and `extension` parameters.
  This inconsistency stemmed from the `fragment` parameter influencing the
  types of these variables.

Key issues
  1. `path` was defined as `str`, while `extension` was `bytes`.
  2. Depending on `fragment`, `extension` could be either `str` or `bytes`.
  3. `path` was sometimes converted to `bytes` within `_legalize_stage`.

Item.destination` method
  - The `fragment` parameter determined the output format:
    - `False`: Returned absolute path as bytes (default)
    - `True`: Returned path relative to library directory as str

Thus
  - Rename `fragment` parameter to `relative_to_libdir` for clarity
  - Ensure `Item.destination` returns `bytes` in all cases
  - Code expecting strings now converts the output to `str`
  - Use only `str` type in `_legalize_stage` and `_legalize_path`
    functions
  - These functions are no longer dependent on `relative_to_libdir`
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants