-
-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Allow for context-sensitive parsing #158
Comments
Apologiies if what I am asking for isn't releated to this issue. Is it possible to get something similiar to regex capture group? Examples (syntax can be anything): Using
In the above, the scope of a captured rule need to considered. I am not that familiar with parsing, so I am not sure if I would be of any help regarding this. However, if you need further clarification, or any input, feel free to ask it. I will do what I can. :-) Edit: One more thing needs to considered: To use same rule for multiple captures in same scope, for the above syntax, captured index can be mentioned. Reason: Like I mentioned in #215 [1], I am trying to write a grammar for Markdown. I am stuck at writing a rule for fenced code blocks. From CommonMark spec (above example-124 [2]):
Without capture group/node, If I am right, the only way I can think of to solve this is to use a global state. That will make this more complicated, as I will have to save source string of some parsed non-code-blocks, and re-parse some code blocks. I haven't implemented this, so I am not sure how feasible this is. If I am doing this in a wrong way, or over-complicating this, please do say so! :-) [1] #215 (comment) |
I think the above will also work nicely with indentation-sensitive languages. Simple example:
For this to work, a capture's scope should be such that the a rule should be able to access captures in it's parent scope/rules - like functions in most programming languages, which can access variables in its' outer scope. Therefore, in the above grammar, A random thought. Hope this makes sense! :-) |
@MuhammedZakir Thanks for contributing your thoughts here! That's an interesting suggestion, and maybe something like this could work for Ohm. I still need some time to think through all of the details, but this has inspired me to see what I could find in the literature for solving this problem. Here's what I found after a quick search:
I'd like to take the time to read this papers and think through the implications for Ohm. Maybe you're interested to read these as well! |
A Symbol-Based Extension of Parsing Expression Grammars and Context-Sensitive Packrat Parsing Paper: https://dl.acm.org/doi/10.1145/3136014.3136025
All symol operations in SPEG:
-- Edit: Might be helpful (haven't read this): Is stateful packrat parsing really linear in practice? a counter-example, an improved grammar, and its parsing algorithms |
Any updates/thoughts? 👀 |
@MuhammedZakir Unfortunately I haven't had the time to investigate this deeply yet. But, the SPEG approach seems nice and I could imagine it fitting well into Ohm. I do have some more time in the coming weeks/months so this may be something that I can dig into soon. |
Glad to hear that! :-) FYI: Pest supports context-sensitive parsing: https://docs.rs/pest_derive/latest/pest_derive/#push-pop-drop-and-peek. |
I'm trying to extend the grammar of Ohm to include some indentation specific operators as specified in the paper Indentation-sensitive parsing for Parsec also linked above. This will change the following part of Ohm grammar: - Seq = Iter*
-
+ Seq = IterWithIndentation*
+
+ IndentationRel = Eq | Ge | Gt | Any
+
+ Eq = "@="
+ Ge = "@>="
+ Gt = "@>"
+ Any = "@*"
+
+ IterWithIndentation
+ = Iter Ge -- indent_ge
+ | Iter Gt -- indent_gt
+ | Iter Eq -- indent_eq
+ | Iter Any -- indent_any
+ | Iter With these changes, we can write the following grammar for (one of the productions of) while statement in python:
We can now write while a < 100:
a = a + 1
a = a + 4
while a < 10: a=a+1
b = """e
eee
1
""" In order to be able to handle this logic, Ohm would need to make two properties available at each parse:
Each parse needs these properties and produces a new set of properties for the next parse. Side noteThe suggested operators are two characters long, but I believe they read well |
Wouldn't a general method/operator such as the one I mentioned above [1] solves this? [1] #158 (comment) |
@haikyuu Thanks for taking a go at this! This is definitely an interesting experiment. I've been thinking a bit in the past few weeks about how we can handle indentation-sensitive language, and other context-sensitive languages. I'll try to share some more substantial thoughts in the next day or two. For now, I can share my initial thoughts/impressions:
|
@pdubroy I have watched the SPEG video and I find it very powerful and readable syntax.
|
@haikyuu Sounds great! Btw, if I were to experiment with this in Ohm, I'd initially try to do this without changing any syntax at all. I'd create a "dummy" grammar with empty rules. Once it's instantiated in JS, I'd replace the rule bodies with a custom subclass of PExpr. Something like this:
Then of course you'd have to implement the Maybe that's helpful in case you or anyone else ends up experimenting with this. |
Any progress? |
No, I didn't make any progress. Feel free to jump on it if you will 🙏 |
It would be helpful to support state during parsing to enable such things as the off-side rule. This seems to be already called out as a planned extension. From the MSA paper:
The text was updated successfully, but these errors were encountered: