Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Steal pygments regexes #34

Merged
merged 20 commits into from
Jul 7, 2024
Merged

Steal pygments regexes #34

merged 20 commits into from
Jul 7, 2024

Conversation

Moosems
Copy link
Member

@Moosems Moosems commented Jul 3, 2024

Fixes #33

Not sure how reliable this is. @leycec If you choose to try using Salve, would you be able to test this for me and help me find cases this fails in and if its slow in your experience? I will at some point probably choose to optimize the proper_docstring_tokens function as its currently just trying to make it work. I'm sure there are hundreds of better ways than how I did this, I'm just too tired to find them.

@Moosems Moosems self-assigned this Jul 3, 2024
@leycec
Copy link

leycec commented Jul 3, 2024

Ho, ho, ho. Merry Canada Day. As a loyal Canadian, it is my lawful duty to do stuff on GitHub. Sadly, I too am too tired to do stuff on GitHub. It's all Canada Day's fault. I had no choice but to bicycle 50km of rough back-country bicycle trails with my wife today and am now busted. Thanks a lot, Canada Day!

Therefore, testing Salve may never actually happen. I should do that, but I should also maintain @beartype. But I can barely maintain @beartype. Therefore, nothing else tends to happen. That said...

I Code-Reviewed This Anyway! Hurrah!

...actually, I just speed-read the thing. Massive stuff here, bro. Super-intense. I'd be exhausted, too. I don't actually see any big deals – just small annoying efficiency and compatibility stuff that will annoy you. Let's begin with the biggest smell.

If efficiency matters, non-trivial PEP 526-compliant variable assignment type hints like regexes: list[str] = ... and tokens: tuple[_TokenType | Callable, ...] = ... can be a bit problematic. Technically, that approach works. Pragmatically, that approach also reinstantiates those type hints each time those variable assignments are performed. In the case of tokens: tuple[_TokenType | Callable, ...] = ..., you've nested that assignment inside a doubly-nested O(n**2)-style for loop. That's a lot of reinstantiations.

Why are these reinstantiations happening? Because here's what @beartype import hooks like beartype_this_package() are actually doing under the hood:

# beartype_this_package() silently transforms this single statement...
            tokens: tuple[_TokenType | Callable, ...] = [
                cell.cell_contents
                for cell in token_tuple[1].__closure__  # type: ignore
            ][0]

# ...into this pair of statements. I trust we see the problem.
            tokens = [
                cell.cell_contents
                for cell in token_tuple[1].__closure__  # type: ignore
            ][0]
            die_if_unbearable(tokens, tuple[_TokenType | Callable, ...])

Right? Each iteration of that doubly-nested O(n**2)-style for loop is reinstantiating the non-trivial tuple[_TokenType | Callable, ...] type hint, increasing both space and time complexity – especially time complexity.

"B-b-but... this is horrible!", you may now be thinking.

...heh. It's not so bad. Really. Believe my convenient lies. Preserve speed by just globalizing all non-trivial type hints that are annotating variable assignments as so-called type aliases: e.g.,

_ListOfStrs = list[str]
'''
Dumb type alias to avoid inefficient reinstantiation of this type hint
each time the :func:`.get_pygments_comment_regexes` function is called.
'''

def get_pygments_comment_regexes(
    lexer: RegexLexer,
) -> list[tuple[str, _TokenType]]:
    useful_toks = {
        StringToken.Doc,
        StringToken.Heredoc,
        # StringToken.Double,
        CommentToken,
        CommentToken.Multiline,
    }
    #regexes: list[str] = []  # <-- INSTEAD OF THIS...
    regexes: _ListOfStrs = []  # <-- ...just do this. Easy-peasy, huh? High five.

Honestly, this is only really a problem in that doubly-nested O(n**2)-style for loop.

Let's see... what else. Oh. Right:

  • No comments and docstrings scares me. I get it. You're exhausted in body and soul. Still, this is frightening code that gives me a rictus grimace. Consider commenting and docstring-ing this shadow madness whilst you still can, before it plummets over the event horizon of forgetfulness and you realize with dismay you no longer grasp how this improvised explosive device even works anymore. "Did it ever work?", you think to yourself. You begin to harbour second thoughts about life and decide to just play video games instead.
  • Comments like # No idea whats going on here inspire confidence. Let's choose to believe that there's a unit test exercising this edge case. Optimism!
  • Tests like if not tokens: continue seem like noops. Is that actually doing anything? Probably not. Try dropping that entirely and see what happens. Optimism intensifies!
  • Conditionals like if not isinstance(token_tuple[1], Callable): can probably be reduced to just if not callable(token_tuple[1]):. Probably. The latter is a little more readable. Annoying! Next.
  • Privacy violations will blow up on you. The only reason @beartype is still standing on one wooden peg leg after all these years is that @beartype never violates privacy encapsulation. This is what I'm concerned about:
from pygments.token import _TokenType

That's super-dangerous and guaranteed to explode all over your lap when you least expect it. Instead of directly accessing private types that will be renamed or moved in a subsequent pygments release, dynamically define that type by calling the type() builtin on an appropriate object. For some reason, okay it's autism I've actually tested this as working on my local system:

# It may not look like it, but this is a lot less fragile than
# directly trying to import the private "_TokenType" type.
from pygments.lexers import PythonLexer
_TokenType = type(PythonLexer().tokens["root"][0][1])  # <-- ...heh. can't believe that works
'''
Private :class:`pygments.token._TokenType` type safely deduced from an
inscrutable code snippet some autistic dude wrote. Who even knows anymore.
'''
  • The type hint list[tuple[str, _TokenType]] is reinstantiated in multiple places. Space complexity goes brrr. But that doesn't really matter. That's just me being annoying. What really matters is the Don't Repeat Yourself (DRY) violation here. If that data structure ever changes, it well. oh, how it will. you're in for a dry Friday. Again, I'd strongly consider globalizing all non-trivial type hints that are repeatedly referenced in multiple locations as so-called type aliases: e.g.,
_PygmentsCommentRegexes = list[tuple[str, _TokenType]]
'''
Smart type alias to avoid Don't Repeat Yourself (DRY) violations.
'''

def get_pygments_comment_regexes(
    lexer: RegexLexer,
) -> _PygmentsCommentRegexes:
    ...

def proper_docstring_tokens(lexer: RegexLexer, full_text: str) -> list[Token]:
    proper_highlight_regexes: _PygmentsCommentRegexes = (
        get_pygments_comment_regexes(lexer))

If you find yourself defining enough global type aliases like this, consider stuffing them all in a common private salve._muhtypealiases submodule or something. Aaaaaaand...

I'm spent. Let's pretend this was a worthwhile expenditure of somebody's Tuesday. Exhausted cat, I invoke thee!


tuesday do be like that

Moosems added 2 commits July 3, 2024 16:46
@leycec I tried using the idea you gave for private types but got Union type errors for some reason I couldn't really understand. Thank you so much for your time and thoughts and congrats on the bike ride, thats a long stretch!
@Moosems
Copy link
Member Author

Moosems commented Jul 3, 2024

@rdbende @Akuli super big request but would either of you be willing to give a (code review/your thoughts) on this?

@Moosems Moosems added this to the v.0.6.0 milestone Jul 3, 2024
@Moosems
Copy link
Member Author

Moosems commented Jul 5, 2024

Docstring section needs to be moved above the links and hidden chars

@Moosems
Copy link
Member Author

Moosems commented Jul 5, 2024

proper_docstring_tokens should be optimized to take a list of strings which prevents the need for full_text.splitlines() happening multiple times but makes finding the locations of matches harder.

@Moosems
Copy link
Member Author

Moosems commented Jul 6, 2024

@leycec
funny leycec meme

@Moosems Moosems changed the base branch from master to v0.6.0 July 6, 2024 05:19
Moosems added a commit that referenced this pull request Jul 7, 2024
* Bump version and add note to README

Prepare for release tomorrow and move forward for v0.7.0 and v0.8.0 release

* Steal pygments regexes (#34)

* test.py

* Move to server functions and set up proper tests

* Reformat

* Reformat

* Change to beartype typing

* Even more formatting

* Remove regex stealer test

Runs different locally than on gh runner and not worth the time or effort.

* Get docstring areas

* Make function work

* Add type annotation

* format

* Add lots of comments but don't remove private type

@leycec I tried using the idea you gave for private types but got Union type errors for some reason I couldn't really understand. Thank you so much for your time and thoughts and congrats on the bike ride, thats a long stretch!

* Fix a small bug

* Improve highlighting functions significantly

Now it ignores whitespace at the front of the line!

* Stop using private variable

* Format for black and ruff

* Move docstring tokens up

* Update tests

* Fix line number for docstring tokens

Was 1 behind

* Reformat
Moosems added a commit that referenced this pull request Jul 7, 2024
* Use 3.11 everywhere (#38)

* v0.6.0 (#42) (#45)

* Bump version and add note to README

Prepare for release tomorrow and move forward for v0.7.0 and v0.8.0 release

* Steal pygments regexes (#34)

* test.py

* Move to server functions and set up proper tests

* Reformat

* Reformat

* Change to beartype typing

* Even more formatting

* Remove regex stealer test

Runs different locally than on gh runner and not worth the time or effort.

* Get docstring areas

* Make function work

* Add type annotation

* format

* Add lots of comments but don't remove private type

@leycec I tried using the idea you gave for private types but got Union type errors for some reason I couldn't really understand. Thank you so much for your time and thoughts and congrats on the bike ride, thats a long stretch!

* Fix a small bug

* Improve highlighting functions significantly

Now it ignores whitespace at the front of the line!

* Stop using private variable

* Format for black and ruff

* Move docstring tokens up

* Update tests

* Fix line number for docstring tokens

Was 1 behind

* Reformat

* Bump version

* Implement token overwriting (#49)

* Display the problem

* Get working test

* Better overlap checking

* Better tests

* Sort and remove duplicates

* Remove old vestige and format

* Move token merging to highlight file

* Format

* Use overwrite_and_merge_tokens

* Cache important functions

* Remove old file

* Format
Moosems added a commit that referenced this pull request Jul 7, 2024
* Use 3.11 everywhere (#38)

* v0.6.0 (#42) (#45)

* Bump version and add note to README

Prepare for release tomorrow and move forward for v0.7.0 and v0.8.0 release

* Steal pygments regexes (#34)

* test.py

* Move to server functions and set up proper tests

* Reformat

* Reformat

* Change to beartype typing

* Even more formatting

* Remove regex stealer test

Runs different locally than on gh runner and not worth the time or effort.

* Get docstring areas

* Make function work

* Add type annotation

* format

* Add lots of comments but don't remove private type

@leycec I tried using the idea you gave for private types but got Union type errors for some reason I couldn't really understand. Thank you so much for your time and thoughts and congrats on the bike ride, thats a long stretch!

* Fix a small bug

* Improve highlighting functions significantly

Now it ignores whitespace at the front of the line!

* Stop using private variable

* Format for black and ruff

* Move docstring tokens up

* Update tests

* Fix line number for docstring tokens

Was 1 behind

* Reformat

* Bump version

* Implement token overwriting (#49)

* Display the problem

* Get working test

* Better overlap checking

* Better tests

* Sort and remove duplicates

* Remove old vestige and format

* Move token merging to highlight file

* Format

* Use overwrite_and_merge_tokens

* Use caches (#53)

* Cache important functions

* Remove old file

* Format

* Explain python version requirement plans (#54)

* Check that all tokens are in text range
Moosems added a commit that referenced this pull request Jul 8, 2024
* Use 3.11 everywhere (#38)

* v0.6.0 (#42) (#45)

* Bump version and add note to README

Prepare for release tomorrow and move forward for v0.7.0 and v0.8.0 release

* Steal pygments regexes (#34)

* test.py

* Move to server functions and set up proper tests

* Reformat

* Reformat

* Change to beartype typing

* Even more formatting

* Remove regex stealer test

Runs different locally than on gh runner and not worth the time or effort.

* Get docstring areas

* Make function work

* Add type annotation

* format

* Add lots of comments but don't remove private type

@leycec I tried using the idea you gave for private types but got Union type errors for some reason I couldn't really understand. Thank you so much for your time and thoughts and congrats on the bike ride, thats a long stretch!

* Fix a small bug

* Improve highlighting functions significantly

Now it ignores whitespace at the front of the line!

* Stop using private variable

* Format for black and ruff

* Move docstring tokens up

* Update tests

* Fix line number for docstring tokens

Was 1 behind

* Reformat

* Bump version

* Implement token overwriting (#49)

* Display the problem

* Get working test

* Better overlap checking

* Better tests

* Sort and remove duplicates

* Remove old vestige and format

* Move token merging to highlight file

* Format

* Use overwrite_and_merge_tokens

* Use caches (#53)

* Cache important functions

* Remove old file

* Format

* Explain python version requirement plans (#54)

* Tokens outside text range (#56)

* Use 3.11 everywhere (#38)

* v0.6.0 (#42) (#45)

* Bump version and add note to README

Prepare for release tomorrow and move forward for v0.7.0 and v0.8.0 release

* Steal pygments regexes (#34)

* test.py

* Move to server functions and set up proper tests

* Reformat

* Reformat

* Change to beartype typing

* Even more formatting

* Remove regex stealer test

Runs different locally than on gh runner and not worth the time or effort.

* Get docstring areas

* Make function work

* Add type annotation

* format

* Add lots of comments but don't remove private type

@leycec I tried using the idea you gave for private types but got Union type errors for some reason I couldn't really understand. Thank you so much for your time and thoughts and congrats on the bike ride, thats a long stretch!

* Fix a small bug

* Improve highlighting functions significantly

Now it ignores whitespace at the front of the line!

* Stop using private variable

* Format for black and ruff

* Move docstring tokens up

* Update tests

* Fix line number for docstring tokens

Was 1 behind

* Reformat

* Bump version

* Implement token overwriting (#49)

* Display the problem

* Get working test

* Better overlap checking

* Better tests

* Sort and remove duplicates

* Remove old vestige and format

* Move token merging to highlight file

* Format

* Use overwrite_and_merge_tokens

* Use caches (#53)

* Cache important functions

* Remove old file

* Format

* Explain python version requirement plans (#54)

* Check that all tokens are in text range

* Create Windows tests (#57)

* Start small

Use pathlib.Path() and change from ubuntu-latest to Windows-latest

* Format

* Handle different Windows import

* Fix server.py (same issue as last)

* Update names

* Try printing out returned tokens

* Give both outputs

* Update README to note the issue

* Try breaking up different lines

* Use bullet points

* Refactor highlight (#58)

* Change highlight code structure

Moved into multiple separate files (could become a submodule)

* Move to submodule
@Akuli
Copy link

Akuli commented Jul 8, 2024

only subscriptable under Python >= 3.12

I see that you already solved this, but from __future__ import annotations may help in cases like this. It makes Python ignore all type annotations (actually, treat them as strings), so they can contain things that would throw errors.

@Moosems
Copy link
Member Author

Moosems commented Jul 8, 2024

Very interesting

@leycec
Copy link

leycec commented Jul 9, 2024

I see that you already solved this, but from __future__ import annotations may help in cases like this.

...heh. You have invoked the Ultimate Evil, @Akuli. PEP 563 (i.e., She Whose Name Shall Not Be Uttered) is what from __future__ import annotations conditionally enables. The name is deceptive. PEP 563 is, in fact, not the future; PEP 563 is so fundamentally broken that it erupted a CPython dev crisis delaying the release of CPython 3.10 by several months. PEP 563 should never have happened. But it did, but only because no one in the typing community was thinking the real-world runtime implications of PEP 563 through to their obvious conclusions. It's best to ignore the existence of PEP 563. Everyone else does.

PEP 649 – Deferred Evaluation Of Annotations Using Descriptors is the future. PEP 649 has nothing to do with strings. Indeed, anyone attempting to reduce type hints to strings has already committed a grievous affront to nature. Ignore those people.

It makes Python ignore all type annotations (actually, treat them as strings), so they can contain things that would throw errors.

...heh. That's awful. Runtime type-checkers like @beartype and typeguard evaluate type hints at runtime. If your type hint contains "things that would throw errors," your type hints are broken at runtime. @beartype and typeguard raise exceptions on attempting to evaluate broken type hints. Right? Thus, awful.

The only exception to this is forward references (i.e., stringified references to user-defined classes that have yet to be defined). For the moment, everybody does still have to use stringified forward references: e.g.,

def muh_func(muh_arg: 'UndefinedType') -> None:
    print(f'Why you gotta be so undefined, f{muh_arg}?'})

class UndefinedType(object):
    muh_attr: int = 0xFEEDFACE

Right? Stringified forward references are fine... for now. You should know, however, that stringified forward references are basically deprecated under Python ≥ 3.14 – so, next year. That's when PEP 649 comes online. At that point, everybody should gradually stop embedding strings in type hints. By 2030, nobody should still be embedding strings in type hints. The Old World of QA was a bad world.

But... enough about all this boring stuff! It is boring. I am yawning and my face is numb. 🥱

@Moosems, To You I Apologize

OMG. The Bernie Sanders meme! So much nostalgic good feels. Canadians and, indeed, most non-American first-worlders were rooting for that tired old guy. Instead, well... if there's One Rule of GitHub to Rule Them All, it's that one does not discuss American politics. Anywheeeeeere.

Sadly, I have done nothing useful tonight other than pontificate uselessly on the perfidious evil of PEP 563 and from __future__ import annotations. So this is how another night dies, huh? Not with a bang, but a GitHub comment. 😮‍💨

@Moosems
Copy link
Member Author

Moosems commented Jul 9, 2024

Heh, as all good nights do. Get some good rest!

@leycec
Copy link

leycec commented Jul 13, 2024

Geh. Finally, I admit to myself I probably won't be able to code review anything else. I'm so sorry, @Moosems. You've thrown so much excruciating volunteerism at this, too. The @beartype 0.19.0 release clock is a ticking time bomb that has me sweating bullets in a shack in the woods.

I do have a theoretical question:

"Regular expressions, huh?"

I get it. I'd steal everything not nailed to the floor from pygments, too. Going forward, though, I see Salve as potentially offering something a lot more interesting and substantial than regular expression-based parsing. Well, theoretically. It's all up to you and how much hardcore Computer Science parsing theory you can stomach around here. The issue, of course, is that...

Regular Expressions Suck

Srsly. Almost all programming languages that matter – including Python itself – are not regular languages. Therefore, they cannot be parsed with regular expressions. Period. Conventional syntax highlighting frameworks like pygments get away this by failing in common edge cases, which isn't that great.

Almost all programming languages that matter aren't even context-free languages (i.e., the next highest order of languages above regular languages). Almost all programming languages that matter are non-context-free. This means that the only means of actually parsing almost all programming languages that matter in a way that doesn't suck and copiously break down in common edge cases is with...

Parser Expression Grammars (PEGs)

Almost all programming languages that matter are perfectly parsable with Parser Expression Grammars (PEGs) – the most popular variant of parser in the modern era. PEGs produce extremely fast parsers that perfectly parse most non-context-free languages.

This includes the official Python language itself. In fact, this includes the official CPython parser baked into the official CPython interpreter. The official CPython parser is actually a PEG-based parser authored by Guido himself. The third-party parsimonious package even makes it mostly trivial to author your PEG-based parser in pure-Python.

I guess what I'm saying here is... the one thing that Salve can conceivably do better than anyone else (including pygments) is to perform PEG- rather than regex-based parsing. That would be a "killer app" for Salve – and a huge incentive for downstream editors, libraries, and plugins to adopt, embrace, and extend Salve.

PEG: when you gotta parse, parse it right. Knock the competition down a peg. 💪 😆

@Moosems
Copy link
Member Author

Moosems commented Jul 13, 2024

@leycec By the way, just thought I should let you in on the secret: the v1.0.0 branch now has support for tree sitter and will soon support LSP ;)

@Moosems
Copy link
Member Author

Moosems commented Jul 14, 2024

I'm also noticing parsimonious hasn't been maintained for a while despite PR's and issues being opened. Maybe theres a more maintained form that could be used or I implement my own

@Akuli
Copy link

Akuli commented Jul 14, 2024

...heh. You have invoked the Ultimate Evil, @Akuli.

Thanks for letting me know about how the situation has changed :)

Still, instead of "Ultimate Evil", I see from __future__ import annotations like most other things in software: sometimes it is the thing you want, sometimes it isn't. If you care about runtime type checking (including things like pydantic), it doesn't work. If you use static type checking, it works on all reasonably new Python versions and it allows annotating code without making a mess.

Almost all programming languages that matter – including Python itself – are not regular languages.

This is correct, but in many cases it doesn't matter for syntax highlighting because you need to tokenize, not parse.

For example, one well-known limitation of regular expressions is that they cannot be used to match parentheses. So just don't match parentheses. Every ( or ) can be just an operator token by itself. Instead of creating AST like Call(func="print", args=[String("hello")]), you create tokens like [Name("print"), Op("("), String('"hello"'), Op(")")].

Nested f-strings are similar, they cannot be handled with regex. But if a user needs more than 2 nested f-strings, something is deeply wrong anyway :)

That said, I don't know whether regex is enough to figure out whether Python's match keyword is being used or not (match is also a valid variable name). Probably not? :)

@Moosems
Copy link
Member Author

Moosems commented Jul 14, 2024

match/case are correct as keywords :)

@leycec
Copy link

leycec commented Jul 15, 2024

I'm also noticing parsimonious hasn't been maintained for a while despite PR's and issues being opened.

...heh. Indeed, I see that you too check open PRs, issues, and most recent commit date before committing to yet another mandatory dependency. You're not wrong. parsimonious is dead. Personally, if it was me doing this very salivating Salve thing, I'd fork parsimonious as parsimonioustoo. ...get it? "too"? 2? i'll stop

On the one hand, nobody wants to maintain yet another dead-on-arrival open-source package. Nobody got time, money, or spare "fricks to give", @leycec! On the other hand, parsimonious does has 1.8k GitHub stars. Resurrecting that would be the fast road to instant GitHub stardom – assuming anyone cares about that sort of thing. Resume, blah, blah, late-stage capitalism, blah, blah, who cares.

Still, depending on how ambitious and muscle-bound you feel about this whole thing, I'd lightly consider:

  • Creating a new overarching github.com/salve organization containing at least:
    • This Moosems/salve repository moved under that organization as github.com/salve/salve.
    • A github.com/salve/parsimonioustoo fork or whatevahs.
    • Various IDE-specific Salve plugins. You just know this is gonna happen, @Moosems. I can already see:
      • github.com/salve/vim-salve. So much hype!
      • github.com/salve/neovim-salve. Almost as much hype!
      • github.com/salve/vscode-salve. Boooooooo! Still, VSCode dominates. It is sad, yet reality is sad.

Basically, everybody regrets not making a project organization when they have the chance – and when they have the chance is when a project is young and idealistic. As soon as a project grows up, it becomes increasingly impossible to move. It gets cruft and balding, much like myself. 👨‍🦲

@Moosems
Copy link
Member Author

Moosems commented Jul 15, 2024

The way I see it, when it comes to syntax highlighting options for the users there are three main ones available:

  1. A somewhat slow but also fairly decent highlighter that is flexible and requires almost no setup (the pygments based one)
  2. A faster highlighter that's incredibly custom and flexible but requires a lot more work to set up and use (the parsimonious-like variant)
  3. A very fast syntax highlighter that's not less custom/flexible and requires a bit more work than the first option but a lot less than the second (tree sitter)

This work can be mitigated (and I plan to) by making the salve dependency hub (a separate repo) hold not only tree sitter highlight languages but also pre-made mappings for certain common languages I use and would like mappings for (E.G. Python). The idea is that salve holds the main API and the nice logic but is still very flexible and salve dependency hub will offer pre made solutions for the API's needs that makes it even simpler than before to get something of good quality with next to no work. That being said, I've considered making this it's own organization a lot and it would make a lot of sense but I fear the dreaded CI/CD costs. Currently I am able to get free access to GitHub Pro and it doesn't seem to be tracking my minutes but because I need to run a variety of quality checks on code before pushing to main (as I lack the ability to get outside reviewers) I use a fair amount of CI/CD minutes per push to even any given branch (generally about 8/9) and that can quickly build up to the 2,000 free minutes a free organization gets in just under 200 pushes in a month.

@leycec
Copy link

leycec commented Jul 16, 2024

That being said, I've considered making this it's own organization a lot and it would make a lot of sense but I fear the dreaded CI/CD costs.

...heh. I too fear money. Fortunately, it's free! Like public repos, public organizations pay nothing. GitHub Actions-based CI/CD is just as free for public organizations as it is for public repos. Blessings be upon Microsoft. Their generosity trickles down onto us all.

You think @leycec is rolling in bills? I live in a cabin in the Canadian wilderness with three smelly cats and a beautiful scientist wife. The only bills we roll in are the Monopoly money I furiously throw about as I lose yet another late-night gaming session. curse you, park place. curse you.

Of course, you receive what you get. In this case, public organizations only receive two member tiers for controlling access:

  • Owners. That's you and basically nobody else. Owners can destroy and create the Universe itself. I'd avoid promoting Internet strangers to Owners over your organization unless you really get to know them. Interestingly, this happened with @beartype recently. I promoted an awesome ex-Microsoft Dutch dude I've gradually gotten to know through the @beartype organization to Co-Owner and... it went great! You will eventually be so lucky, too.
  • Members. This is everybody else. Members own their own repositories in your organization. Members can destroy and create only within those repositories. Members are a super-safe way to migrate other people's repositories into your organization while still allowing those people to retain ownership over their own repositories. @beartype does this a lot. Assimilate them all, @beartype.

Private organizations have considerably more access control mechanisms. Still, two member tiers is pretty much all I personally have ever needed. Just... yeah. Be cautious about that "Owner" role. Trust is a dead-end alley that mostly ends in knife fights about back behind a dumpster. 🔪 🗑️

I use a fair amount of CI/CD minutes per push to even any given branch...

...huh. Really? On public open-source repositories? That doesn't quite seem right. The @beartype test suite is exhausting, exhausting, and extremely long-winded. It currently runs against Python 3.8 through 3.13 (inclusive) on Windows, macOS, and Linux. I've never hit a single issue with CI/CD minutes or scarce limits or... well, anything, honestly. I don't have GitHub Pro, because I don't have money. I haven't paid a single Holy Dollah Bill to Microsoft – which I kinda feel bad about, because Microsoft has gone above and beyond the call of open-source duty here.

Still, I'll start paying Microsoft money when Microsoft starts paying @beartype money. Microsoft does extensively use @beartype throughout their AI workflows, after all. Do the right thing, Satya. Grow @beartype into a disgusting behemoth of money! 🤑

If you ever do hit resource limits, ...for whatevah reason just inject this tidy one-liner into each job of your .github/workflow/test.yml workflow:

    # Time in minutes to wait on the command pipeline run below to exit
    # *BEFORE* sending a non-graceful termination request (i.e., "SIGTERM"
    # under POSIX-compliant systems).
    timeout-minutes: 10

That's what all @beartype workflows do, actually. That timeout only ever triggers when infinite recursion happens, which it occasionally does – almost always in somebody else's third-party code. @beartype is blameless. I swear it!

I need to run a variety of quality checks on code before pushing to main

Oh. Is that what's happening here? Huh. Didn't know that. Only CI/CD on main is free, huh? Fascinating. Lesson learned.

Still, can't you just run those quality checks locally? pytest and tox are disgusting behemoths of configuration. You can turn almost anything into a pytest unit test, given sufficient obsessive-compulsiveness and a willingness to suffer. 😉

Before I push to @beartype's remote main branch, I just locally run tox – which then dispatches pytest across all supported Python versions on my Linux box. Since I run Gentoo Linux, I already have all supported Python versions installed as system packages; if I was on another Linux distro, I'd just use pyenv to locally install different Python versions. In either case, local testing catches most issues. I then push to @beartype's remote main. If doing so erupts with a GitHub Actions-based CI/CD failure, I just reassess and push out a new "fix-up" commit as soon as feasible.

I wouldn't sweat that sort of thing too much, really. The managerial powers that an organization affords are way more critical to the success of a Big Project like @beartype or Salve than lower-level stuff like CI/CD minutes. You'll figure out the lower-level stuff. It will all flow like spice.

@Moosems
Copy link
Member Author

Moosems commented Jul 16, 2024

I always run locally anyway but it's making sure the windows tests succeed

@Moosems
Copy link
Member Author

Moosems commented Jul 16, 2024

I'm actually running into some trouble with how windows handles characters to be honest and I'm not sure what to do about it

@leycec
Copy link

leycec commented Jul 17, 2024

Yikes. Windows UTF-8 encoding hell, huh? The old demon rises again. Perhaps I can help, though. Kindly point me towards a failing unit test and I'll crack my arthritic knuckles for you.

@Moosems
Copy link
Member Author

Moosems commented Jul 17, 2024

Maybe the enumerate doesn't work for graphemes?

@Moosems
Copy link
Member Author

Moosems commented Jul 17, 2024

I can't effectively debug this though without a second machine is the hardest part

@leycec
Copy link

leycec commented Jul 18, 2024

I can't effectively debug this though without a second machine is the hardest part

...hmm. Personally, I'd never go that far. To paraphrase the once-great now-zombified Meatloaf:

"I'd do anything for open-source, but I won't do that."

Installing Windows is a bridge too far. I won't install Windows even under a virtual machine. Heck, I won't even install FreeBSD under a virtual machine! It's work. You know? Nobody's gainfully employing you to do this. So, nobody's paid enough money for you to torture yourself. Open-source is supposed to be fun. When open-source ceases being fun, I stop doing open-source.

@leycec Heres all the relevant files and the lines:

Thanks so much for that. Would you also happen to have a failing test I can peruse? I've perused your most recent GitHub Actions, but everything looks passing. Clearly, looks are deceiving.

Incidentally, I'd avoid the += operator with respect to mutable containers like lists. Instead, consider explicitly calling the list.extend() method. The += operator is well-known to behave contextually and thus super-weirdly in certain contexts. It probably won't help – but at least the resulting code will be more explicit, readable, and debuggable: e.g.,

    # Instead of this suspiciously smelly code...
    new_tokens += get_special_tokens(full_text, split_text, text_range[0])

    # Consider this less suspicious code that hardly smells at all.
    new_tokens.extend(get_special_tokens(full_text, split_text, text_range[0]))

Monocle face approves. 🧐

@Moosems
Copy link
Member Author

Moosems commented Jul 18, 2024

The tests succeed because in the test_IPC file I just added an if statement that changes the expected output to match what windows actually gives vs what it should give ideally

@Moosems
Copy link
Member Author

Moosems commented Jul 18, 2024

The .extend will be added in a commit to the v1.0.0 branch soon :)

@leycec
Copy link

leycec commented Jul 19, 2024

The tests succeed because in the test_IPC file I just added an if statement that changes the expected output to match what windows actually gives vs what it should give ideally

Gotcha. That approach is sound. What you're doing is both totally valid and exactly what you have to do when dealing with POSIX-noncompliant platforms like Windows. Every project in every language ever inevitably confronts irregularities with Windows and responds in a similar way:

"Just hack it until unit tests shut up."

Windows is why I still cry at night. It helps that nobody in the Python world cares about Windows.

Windows Python users are generally expected to run Python under the Windows Subsystem for Linux (WSL) – which basically reduces to "virtual machine Ubuntu" for most users. Attempting to run Python directly from either Powershell or CMD.exe (i.e., a "command prompt") is quietly frowned upon. Superficially, most Python packages claim "Windows compatibility;" in actuality, most Python package maintainers deprioritize or even outright close Windows issues with curt responses like "lolbro, just use WSL." Nobody got scarce open-source volunteer time to fix irrational bugs in a closed-source operating system that most of us don't even particularly care about.

The .extend will be added in a commit to the v1.0.0 branch soon :)

🥳

Personally, I love parsing. If I wasn't doing @beartype, I'd do parsing. I hope you're having a blast with it, too. Let's parse!

@Moosems
Copy link
Member Author

Moosems commented Jul 19, 2024

Yeah, for most of the repos life I just decided I wouldn't support Windows at all but when I realized it could run on windows, I decided that I'll say it runs in Windows but I'm not maintaining with Windows and mind and the only changes for it will come from other contributors because I don't care to deal with the mess it makes

@leycec
Copy link

leycec commented Jul 20, 2024

Absolutely. This is the way. Myself, I'm currently banging out an honest-to-godness Finite State Machine (FSM) efficiently inferring the narrowest collections.abc ABC satisfied by an arbitrary object. Pretty sure I've written a PhD dissertation at this point. Finite state machines: when you know you've gone too far.

Hope you're having a bang-up summer! Let us all bang on the code drum until something gives.

@Moosems
Copy link
Member Author

Moosems commented Jul 20, 2024

See #73

@Moosems
Copy link
Member Author

Moosems commented Aug 29, 2024

Parser Expression Grammars (PEGs)

Almost all programming languages that matter are perfectly parsable with Parser Expression Grammars (PEGs) – the most popular variant of parser in the modern era. PEGs produce extremely fast parsers that perfectly parse most non-context-free languages.

This includes the official Python language itself. In fact, this includes the official CPython parser baked into the official CPython interpreter. The official CPython parser is actually a PEG-based parser authored by Guido himself. The third-party parsimonious package even makes it mostly trivial to author your PEG-based parser in pure-Python.

I guess what I'm saying here is... the one thing that Salve can conceivably do better than anyone else (including pygments) is to perform PEG- rather than regex-based parsing. That would be a "killer app" for Salve – and a huge incentive for downstream editors, libraries, and plugins to adopt, embrace, and extend Salve.

PEG: when you gotta parse, parse it right. Knock the competition down a peg. 💪 😆

@leycec For the record, I will be making a PEG parsing implementation someday soon, regardless of how this discussion goes, I just have to ask one question: what exactly does the efficiently of something like a PEG parser look like? Here's the main pros and cons I see in PEG parsing conspired to something Tree Sitter:

PEG Parsing:

Pros:

  • Fairly easy to write
  • Reasonably fast from what I can tell
  • Great at parsing pretty much anything

Cons:

  • Not incremental and I can see efficiency loss there
  • Not super robust when the input doesn't fit the grammar

Tree Sitter

Pros:

  • CRAZY fast
  • Pretty robust

Cons:

  • Terrible to create grammars for
  • Hard to use

Note that I'm not an expert on either of these, so I may be completely wrong about multiple of these points, this is just what I've found from my many many hours of google searching, research paper delving, and chat gippitying.
Here's what my main goal is with Salve Organization: create tools that make it easy to extend my code editors.
Most notable tools I want to make:

  • Good and easy syntax highlighting library
  • Good IPC library for offloading work
  • Good LSP support
  • A collection of homemade tools that can fill these in

So far I have Salve which is my homemade tools, Collegamento for IPC (being completely rewritten atm), and Albero for Tree Sitter highlighting. That leaves LSP. Issue? I think that Tree Sitter isn't the best option that could be made. Tree Sitter is kinda clunky to use, not super well explained anywhere, and hard to write grammars for. What it does excel is speed. PEG parsers are not very well explained from what I've seen but they're super nice to write grammars for and pretty fast. If I could connect the speed and easy to write and use natures of the two, that would be ideal.

TL;DR

What is the efficiency of PEG parsers? How viable are they for Syntax Highlighting? Could they be made faster?

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi line comments and docstrings aren't highlighted right
3 participants