-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Steal pygments regexes #34
Conversation
Runs different locally than on gh runner and not worth the time or effort.
Ho, ho, ho. Merry Canada Day. As a loyal Canadian, it is my lawful duty to do stuff on GitHub. Sadly, I too am too tired to do stuff on GitHub. It's all Canada Day's fault. I had no choice but to bicycle 50km of rough back-country bicycle trails with my wife today and am now busted. Thanks a lot, Canada Day! Therefore, testing Salve may never actually happen. I should do that, but I should also maintain @beartype. But I can barely maintain @beartype. Therefore, nothing else tends to happen. That said... I Code-Reviewed This Anyway! Hurrah!...actually, I just speed-read the thing. Massive stuff here, bro. Super-intense. I'd be exhausted, too. I don't actually see any big deals – just small annoying efficiency and compatibility stuff that will annoy you. Let's begin with the biggest smell. If efficiency matters, non-trivial PEP 526-compliant variable assignment type hints like Why are these reinstantiations happening? Because here's what @beartype import hooks like # beartype_this_package() silently transforms this single statement...
tokens: tuple[_TokenType | Callable, ...] = [
cell.cell_contents
for cell in token_tuple[1].__closure__ # type: ignore
][0]
# ...into this pair of statements. I trust we see the problem.
tokens = [
cell.cell_contents
for cell in token_tuple[1].__closure__ # type: ignore
][0]
die_if_unbearable(tokens, tuple[_TokenType | Callable, ...]) Right? Each iteration of that doubly-nested
...heh. It's not so bad. Really. Believe my convenient lies. Preserve speed by just globalizing all non-trivial type hints that are annotating variable assignments as so-called type aliases: e.g., _ListOfStrs = list[str]
'''
Dumb type alias to avoid inefficient reinstantiation of this type hint
each time the :func:`.get_pygments_comment_regexes` function is called.
'''
def get_pygments_comment_regexes(
lexer: RegexLexer,
) -> list[tuple[str, _TokenType]]:
useful_toks = {
StringToken.Doc,
StringToken.Heredoc,
# StringToken.Double,
CommentToken,
CommentToken.Multiline,
}
#regexes: list[str] = [] # <-- INSTEAD OF THIS...
regexes: _ListOfStrs = [] # <-- ...just do this. Easy-peasy, huh? High five. Honestly, this is only really a problem in that doubly-nested Let's see... what else. Oh. Right:
from pygments.token import _TokenType That's super-dangerous and guaranteed to explode all over your lap when you least expect it. Instead of directly accessing private types that will be renamed or moved in a subsequent # It may not look like it, but this is a lot less fragile than
# directly trying to import the private "_TokenType" type.
from pygments.lexers import PythonLexer
_TokenType = type(PythonLexer().tokens["root"][0][1]) # <-- ...heh. can't believe that works
'''
Private :class:`pygments.token._TokenType` type safely deduced from an
inscrutable code snippet some autistic dude wrote. Who even knows anymore.
'''
_PygmentsCommentRegexes = list[tuple[str, _TokenType]]
'''
Smart type alias to avoid Don't Repeat Yourself (DRY) violations.
'''
def get_pygments_comment_regexes(
lexer: RegexLexer,
) -> _PygmentsCommentRegexes:
...
def proper_docstring_tokens(lexer: RegexLexer, full_text: str) -> list[Token]:
proper_highlight_regexes: _PygmentsCommentRegexes = (
get_pygments_comment_regexes(lexer)) If you find yourself defining enough global type aliases like this, consider stuffing them all in a common private I'm spent. Let's pretend this was a worthwhile expenditure of somebody's Tuesday. Exhausted cat, I invoke thee! |
@leycec I tried using the idea you gave for private types but got Union type errors for some reason I couldn't really understand. Thank you so much for your time and thoughts and congrats on the bike ride, thats a long stretch!
Docstring section needs to be moved above the links and hidden chars |
proper_docstring_tokens should be optimized to take a list of strings which prevents the need for full_text.splitlines() happening multiple times but makes finding the locations of matches harder. |
Now it ignores whitespace at the front of the line!
Was 1 behind
* Bump version and add note to README Prepare for release tomorrow and move forward for v0.7.0 and v0.8.0 release * Steal pygments regexes (#34) * test.py * Move to server functions and set up proper tests * Reformat * Reformat * Change to beartype typing * Even more formatting * Remove regex stealer test Runs different locally than on gh runner and not worth the time or effort. * Get docstring areas * Make function work * Add type annotation * format * Add lots of comments but don't remove private type @leycec I tried using the idea you gave for private types but got Union type errors for some reason I couldn't really understand. Thank you so much for your time and thoughts and congrats on the bike ride, thats a long stretch! * Fix a small bug * Improve highlighting functions significantly Now it ignores whitespace at the front of the line! * Stop using private variable * Format for black and ruff * Move docstring tokens up * Update tests * Fix line number for docstring tokens Was 1 behind * Reformat
* Use 3.11 everywhere (#38) * v0.6.0 (#42) (#45) * Bump version and add note to README Prepare for release tomorrow and move forward for v0.7.0 and v0.8.0 release * Steal pygments regexes (#34) * test.py * Move to server functions and set up proper tests * Reformat * Reformat * Change to beartype typing * Even more formatting * Remove regex stealer test Runs different locally than on gh runner and not worth the time or effort. * Get docstring areas * Make function work * Add type annotation * format * Add lots of comments but don't remove private type @leycec I tried using the idea you gave for private types but got Union type errors for some reason I couldn't really understand. Thank you so much for your time and thoughts and congrats on the bike ride, thats a long stretch! * Fix a small bug * Improve highlighting functions significantly Now it ignores whitespace at the front of the line! * Stop using private variable * Format for black and ruff * Move docstring tokens up * Update tests * Fix line number for docstring tokens Was 1 behind * Reformat * Bump version * Implement token overwriting (#49) * Display the problem * Get working test * Better overlap checking * Better tests * Sort and remove duplicates * Remove old vestige and format * Move token merging to highlight file * Format * Use overwrite_and_merge_tokens * Cache important functions * Remove old file * Format
* Use 3.11 everywhere (#38) * v0.6.0 (#42) (#45) * Bump version and add note to README Prepare for release tomorrow and move forward for v0.7.0 and v0.8.0 release * Steal pygments regexes (#34) * test.py * Move to server functions and set up proper tests * Reformat * Reformat * Change to beartype typing * Even more formatting * Remove regex stealer test Runs different locally than on gh runner and not worth the time or effort. * Get docstring areas * Make function work * Add type annotation * format * Add lots of comments but don't remove private type @leycec I tried using the idea you gave for private types but got Union type errors for some reason I couldn't really understand. Thank you so much for your time and thoughts and congrats on the bike ride, thats a long stretch! * Fix a small bug * Improve highlighting functions significantly Now it ignores whitespace at the front of the line! * Stop using private variable * Format for black and ruff * Move docstring tokens up * Update tests * Fix line number for docstring tokens Was 1 behind * Reformat * Bump version * Implement token overwriting (#49) * Display the problem * Get working test * Better overlap checking * Better tests * Sort and remove duplicates * Remove old vestige and format * Move token merging to highlight file * Format * Use overwrite_and_merge_tokens * Use caches (#53) * Cache important functions * Remove old file * Format * Explain python version requirement plans (#54) * Check that all tokens are in text range
* Use 3.11 everywhere (#38) * v0.6.0 (#42) (#45) * Bump version and add note to README Prepare for release tomorrow and move forward for v0.7.0 and v0.8.0 release * Steal pygments regexes (#34) * test.py * Move to server functions and set up proper tests * Reformat * Reformat * Change to beartype typing * Even more formatting * Remove regex stealer test Runs different locally than on gh runner and not worth the time or effort. * Get docstring areas * Make function work * Add type annotation * format * Add lots of comments but don't remove private type @leycec I tried using the idea you gave for private types but got Union type errors for some reason I couldn't really understand. Thank you so much for your time and thoughts and congrats on the bike ride, thats a long stretch! * Fix a small bug * Improve highlighting functions significantly Now it ignores whitespace at the front of the line! * Stop using private variable * Format for black and ruff * Move docstring tokens up * Update tests * Fix line number for docstring tokens Was 1 behind * Reformat * Bump version * Implement token overwriting (#49) * Display the problem * Get working test * Better overlap checking * Better tests * Sort and remove duplicates * Remove old vestige and format * Move token merging to highlight file * Format * Use overwrite_and_merge_tokens * Use caches (#53) * Cache important functions * Remove old file * Format * Explain python version requirement plans (#54) * Tokens outside text range (#56) * Use 3.11 everywhere (#38) * v0.6.0 (#42) (#45) * Bump version and add note to README Prepare for release tomorrow and move forward for v0.7.0 and v0.8.0 release * Steal pygments regexes (#34) * test.py * Move to server functions and set up proper tests * Reformat * Reformat * Change to beartype typing * Even more formatting * Remove regex stealer test Runs different locally than on gh runner and not worth the time or effort. * Get docstring areas * Make function work * Add type annotation * format * Add lots of comments but don't remove private type @leycec I tried using the idea you gave for private types but got Union type errors for some reason I couldn't really understand. Thank you so much for your time and thoughts and congrats on the bike ride, thats a long stretch! * Fix a small bug * Improve highlighting functions significantly Now it ignores whitespace at the front of the line! * Stop using private variable * Format for black and ruff * Move docstring tokens up * Update tests * Fix line number for docstring tokens Was 1 behind * Reformat * Bump version * Implement token overwriting (#49) * Display the problem * Get working test * Better overlap checking * Better tests * Sort and remove duplicates * Remove old vestige and format * Move token merging to highlight file * Format * Use overwrite_and_merge_tokens * Use caches (#53) * Cache important functions * Remove old file * Format * Explain python version requirement plans (#54) * Check that all tokens are in text range * Create Windows tests (#57) * Start small Use pathlib.Path() and change from ubuntu-latest to Windows-latest * Format * Handle different Windows import * Fix server.py (same issue as last) * Update names * Try printing out returned tokens * Give both outputs * Update README to note the issue * Try breaking up different lines * Use bullet points * Refactor highlight (#58) * Change highlight code structure Moved into multiple separate files (could become a submodule) * Move to submodule
I see that you already solved this, but |
Very interesting |
...heh. You have invoked the Ultimate Evil, @Akuli. PEP 563 (i.e., She Whose Name Shall Not Be Uttered) is what PEP 649 – Deferred Evaluation Of Annotations Using Descriptors is the future. PEP 649 has nothing to do with strings. Indeed, anyone attempting to reduce type hints to strings has already committed a grievous affront to nature. Ignore those people.
...heh. That's awful. Runtime type-checkers like @beartype and The only exception to this is forward references (i.e., stringified references to user-defined classes that have yet to be defined). For the moment, everybody does still have to use stringified forward references: e.g., def muh_func(muh_arg: 'UndefinedType') -> None:
print(f'Why you gotta be so undefined, f{muh_arg}?'})
class UndefinedType(object):
muh_attr: int = 0xFEEDFACE Right? Stringified forward references are fine... for now. You should know, however, that stringified forward references are basically deprecated under Python ≥ 3.14 – so, next year. That's when PEP 649 comes online. At that point, everybody should gradually stop embedding strings in type hints. By 2030, nobody should still be embedding strings in type hints. The Old World of QA was a bad world. But... enough about all this boring stuff! It is boring. I am yawning and my face is numb. 🥱 @Moosems, To You I ApologizeOMG. The Bernie Sanders meme! So much nostalgic good feels. Canadians and, indeed, most non-American first-worlders were rooting for that tired old guy. Instead, well... if there's One Rule of GitHub to Rule Them All, it's that one does not discuss American politics. Anywheeeeeere. Sadly, I have done nothing useful tonight other than pontificate uselessly on the perfidious evil of PEP 563 and |
Heh, as all good nights do. Get some good rest! |
Geh. Finally, I admit to myself I probably won't be able to code review anything else. I'm so sorry, @Moosems. You've thrown so much excruciating volunteerism at this, too. The @beartype 0.19.0 release clock is a ticking time bomb that has me sweating bullets in a shack in the woods. I do have a theoretical question:
I get it. I'd steal everything not nailed to the floor from Regular Expressions SuckSrsly. Almost all programming languages that matter – including Python itself – are not regular languages. Therefore, they cannot be parsed with regular expressions. Period. Conventional syntax highlighting frameworks like Almost all programming languages that matter aren't even context-free languages (i.e., the next highest order of languages above regular languages). Almost all programming languages that matter are non-context-free. This means that the only means of actually parsing almost all programming languages that matter in a way that doesn't suck and copiously break down in common edge cases is with... Parser Expression Grammars (PEGs)Almost all programming languages that matter are perfectly parsable with Parser Expression Grammars (PEGs) – the most popular variant of parser in the modern era. PEGs produce extremely fast parsers that perfectly parse most non-context-free languages. This includes the official Python language itself. In fact, this includes the official CPython parser baked into the official CPython interpreter. The official CPython parser is actually a PEG-based parser authored by Guido himself. The third-party I guess what I'm saying here is... the one thing that Salve can conceivably do better than anyone else (including PEG: when you gotta parse, parse it right. Knock the competition down a peg. 💪 😆 |
@leycec By the way, just thought I should let you in on the secret: the v1.0.0 branch now has support for tree sitter and will soon support LSP ;) |
I'm also noticing parsimonious hasn't been maintained for a while despite PR's and issues being opened. Maybe theres a more maintained form that could be used or I implement my own |
Thanks for letting me know about how the situation has changed :) Still, instead of "Ultimate Evil", I see
This is correct, but in many cases it doesn't matter for syntax highlighting because you need to tokenize, not parse. For example, one well-known limitation of regular expressions is that they cannot be used to match parentheses. So just don't match parentheses. Every Nested f-strings are similar, they cannot be handled with regex. But if a user needs more than 2 nested f-strings, something is deeply wrong anyway :) That said, I don't know whether regex is enough to figure out whether Python's |
|
...heh. Indeed, I see that you too check open PRs, issues, and most recent commit date before committing to yet another mandatory dependency. You're not wrong. On the one hand, nobody wants to maintain yet another dead-on-arrival open-source package. Nobody got time, money, or spare "fricks to give", @leycec! On the other hand, Still, depending on how ambitious and muscle-bound you feel about this whole thing, I'd lightly consider:
Basically, everybody regrets not making a project organization when they have the chance – and when they have the chance is when a project is young and idealistic. As soon as a project grows up, it becomes increasingly impossible to move. It gets cruft and balding, much like myself. 👨🦲 |
The way I see it, when it comes to syntax highlighting options for the users there are three main ones available:
This work can be mitigated (and I plan to) by making the salve dependency hub (a separate repo) hold not only tree sitter highlight languages but also pre-made mappings for certain common languages I use and would like mappings for (E.G. Python). The idea is that salve holds the main API and the nice logic but is still very flexible and salve dependency hub will offer pre made solutions for the API's needs that makes it even simpler than before to get something of good quality with next to no work. That being said, I've considered making this it's own organization a lot and it would make a lot of sense but I fear the dreaded CI/CD costs. Currently I am able to get free access to GitHub Pro and it doesn't seem to be tracking my minutes but because I need to run a variety of quality checks on code before pushing to main (as I lack the ability to get outside reviewers) I use a fair amount of CI/CD minutes per push to even any given branch (generally about 8/9) and that can quickly build up to the 2,000 free minutes a free organization gets in just under 200 pushes in a month. |
...heh. I too fear money. Fortunately, it's free! Like public repos, public organizations pay nothing. GitHub Actions-based CI/CD is just as free for public organizations as it is for public repos. Blessings be upon Microsoft. Their generosity trickles down onto us all. You think @leycec is rolling in bills? I live in a cabin in the Canadian wilderness with three smelly cats and a beautiful scientist wife. The only bills we roll in are the Monopoly money I furiously throw about as I lose yet another late-night gaming session. curse you, park place. curse you. Of course, you receive what you get. In this case, public organizations only receive two member tiers for controlling access:
Private organizations have considerably more access control mechanisms. Still, two member tiers is pretty much all I personally have ever needed. Just... yeah. Be cautious about that "Owner" role. Trust is a dead-end alley that mostly ends in knife fights about back behind a dumpster. 🔪 🗑️
...huh. Really? On public open-source repositories? That doesn't quite seem right. The @beartype test suite is exhausting, exhausting, and extremely long-winded. It currently runs against Python 3.8 through 3.13 (inclusive) on Windows, macOS, and Linux. I've never hit a single issue with CI/CD minutes or scarce limits or... well, anything, honestly. I don't have GitHub Pro, because I don't have money. I haven't paid a single Holy Dollah Bill to Microsoft – which I kinda feel bad about, because Microsoft has gone above and beyond the call of open-source duty here. Still, I'll start paying Microsoft money when Microsoft starts paying @beartype money. Microsoft does extensively use @beartype throughout their AI workflows, after all. Do the right thing, Satya. Grow @beartype into a disgusting behemoth of money! 🤑 If you ever do hit resource limits, ...for whatevah reason just inject this tidy one-liner into each job of your # Time in minutes to wait on the command pipeline run below to exit
# *BEFORE* sending a non-graceful termination request (i.e., "SIGTERM"
# under POSIX-compliant systems).
timeout-minutes: 10 That's what all @beartype workflows do, actually. That timeout only ever triggers when infinite recursion happens, which it occasionally does – almost always in somebody else's third-party code. @beartype is blameless. I swear it!
Oh. Is that what's happening here? Huh. Didn't know that. Only CI/CD on Still, can't you just run those quality checks locally? Before I push to @beartype's remote I wouldn't sweat that sort of thing too much, really. The managerial powers that an organization affords are way more critical to the success of a Big Project like @beartype or Salve than lower-level stuff like CI/CD minutes. You'll figure out the lower-level stuff. It will all flow like spice. |
I always run locally anyway but it's making sure the windows tests succeed |
I'm actually running into some trouble with how windows handles characters to be honest and I'm not sure what to do about it |
Yikes. Windows UTF-8 encoding hell, huh? The old demon rises again. Perhaps I can help, though. Kindly point me towards a failing unit test and I'll crack my arthritic knuckles for you. |
Maybe the enumerate doesn't work for graphemes? |
I can't effectively debug this though without a second machine is the hardest part |
...hmm. Personally, I'd never go that far. To paraphrase the once-great now-zombified Meatloaf:
Installing Windows is a bridge too far. I won't install Windows even under a virtual machine. Heck, I won't even install FreeBSD under a virtual machine! It's work. You know? Nobody's gainfully employing you to do this. So, nobody's paid enough money for you to torture yourself. Open-source is supposed to be fun. When open-source ceases being fun, I stop doing open-source.
Thanks so much for that. Would you also happen to have a failing test I can peruse? I've perused your most recent GitHub Actions, but everything looks passing. Clearly, looks are deceiving. Incidentally, I'd avoid the # Instead of this suspiciously smelly code...
new_tokens += get_special_tokens(full_text, split_text, text_range[0])
# Consider this less suspicious code that hardly smells at all.
new_tokens.extend(get_special_tokens(full_text, split_text, text_range[0])) Monocle face approves. 🧐 |
The tests succeed because in the test_IPC file I just added an if statement that changes the expected output to match what windows actually gives vs what it should give ideally |
The .extend will be added in a commit to the v1.0.0 branch soon :) |
Gotcha. That approach is sound. What you're doing is both totally valid and exactly what you have to do when dealing with POSIX-noncompliant platforms like Windows. Every project in every language ever inevitably confronts irregularities with Windows and responds in a similar way:
Windows is why I still cry at night. It helps that nobody in the Python world cares about Windows. Windows Python users are generally expected to run Python under the Windows Subsystem for Linux (WSL) – which basically reduces to "virtual machine Ubuntu" for most users. Attempting to run Python directly from either Powershell or
🥳 Personally, I love parsing. If I wasn't doing @beartype, I'd do parsing. I hope you're having a blast with it, too. Let's parse! |
Yeah, for most of the repos life I just decided I wouldn't support Windows at all but when I realized it could run on windows, I decided that I'll say it runs in Windows but I'm not maintaining with Windows and mind and the only changes for it will come from other contributors because I don't care to deal with the mess it makes |
Absolutely. This is the way. Myself, I'm currently banging out an honest-to-godness Finite State Machine (FSM) efficiently inferring the narrowest Hope you're having a bang-up summer! Let us all bang on the code drum until something gives. |
See #73 |
@leycec For the record, I will be making a PEG parsing implementation someday soon, regardless of how this discussion goes, I just have to ask one question: what exactly does the efficiently of something like a PEG parser look like? Here's the main pros and cons I see in PEG parsing conspired to something Tree Sitter: PEG Parsing:Pros:
Cons:
Tree SitterPros:
Cons:
Note that I'm not an expert on either of these, so I may be completely wrong about multiple of these points, this is just what I've found from my many many hours of google searching, research paper delving, and chat gippitying.
So far I have Salve which is my homemade tools, Collegamento for IPC (being completely rewritten atm), and Albero for Tree Sitter highlighting. That leaves LSP. Issue? I think that Tree Sitter isn't the best option that could be made. Tree Sitter is kinda clunky to use, not super well explained anywhere, and hard to write grammars for. What it does excel is speed. PEG parsers are not very well explained from what I've seen but they're super nice to write grammars for and pretty fast. If I could connect the speed and easy to write and use natures of the two, that would be ideal. TL;DRWhat is the efficiency of PEG parsers? How viable are they for Syntax Highlighting? Could they be made faster? |
Fixes #33
Not sure how reliable this is. @leycec If you choose to try using Salve, would you be able to test this for me and help me find cases this fails in and if its slow in your experience? I will at some point probably choose to optimize the proper_docstring_tokens function as its currently just trying to make it work. I'm sure there are hundreds of better ways than how I did this, I'm just too tired to find them.