Allow for context-sensitive parsing #158

rtoal · 2016-09-20T17:35:32Z

It would be helpful to support state during parsing to enable such things as the off-side rule. This seems to be already called out as a planned extension. From the MSA paper:

While we have found Ohm’s expressive power to be sufficient for most tasks, several users have requested better support for parsing context-sensitive languages like HTML and Python. We plan to investigate whether a solution like parsing contexts (as implemented in PetitParser) could be adapted to Ohm.

MuhammedZakir · 2021-08-08T15:39:41Z

Apologiies if what I am asking for isn't releated to this issue.

Is it possible to get something similiar to regex capture group? Examples (syntax can be anything):

Using @rule for capturing, and $rule for using a captured rule.

html_elem = "<" @tag ">" anythingExceptClosing<$tag> "</" $tag ">"

fenced_code_block = @f_c_marker lineWithout<$f_c_marker> $f_c_marker

In the above, the scope of a captured rule need to considered. I am not that familiar with parsing, so I am not sure if I would be of any help regarding this. However, if you need further clarification, or any input, feel free to ask it. I will do what I can. :-)

Edit: One more thing needs to considered: To use same rule for multiple captures in same scope, for the above syntax, captured index can be mentioned.

Reason:

Like I mentioned in #215 [1], I am trying to write a grammar for Markdown. I am stuck at writing a rule for fenced code blocks. From CommonMark spec (above example-124 [2]):

The closing code fence must be at least as long as the opening fence

Without capture group/node, If I am right, the only way I can think of to solve this is to use a global state. That will make this more complicated, as I will have to save source string of some parsed non-code-blocks, and re-parse some code blocks. I haven't implemented this, so I am not sure how feasible this is. If I am doing this in a wrong way, or over-complicating this, please do say so! :-)

[1] #215 (comment)
[2] https://spec.commonmark.org/0.30/#example-124

MuhammedZakir · 2021-08-09T17:46:07Z

I think the above will also work nicely with indentation-sensitive languages.

Simple example:

class = @class_indent class_rule
func = @func_indent func_rule ($func_indent func_rule)*
statement = @stmt_indent statement_rule ($stmt_indent statatement_rule)*

class_indent = indent
// optional indent because both function and statement
// could also be in main scope.
func_indent = $class_indent? indent
stmt_indent = $func_indent? indent

indent = space+ | tab+

For this to work, a capture's scope should be such that the a rule should be able to access captures in it's parent scope/rules - like functions in most programming languages, which can access variables in its' outer scope. Therefore, in the above grammar, class_rule should contain func for func rule to access @class_indent capture, and func_rule should contain statement for statement to access @func_indent capture.

A random thought. Hope this makes sense! :-)

pdubroy · 2021-08-10T08:18:31Z

@MuhammedZakir Thanks for contributing your thoughts here! That's an interesting suggestion, and maybe something like this could work for Ohm. I still need some time to think through all of the details, but this has inspired me to see what I could find in the literature for solving this problem. Here's what I found after a quick search:

Indentation Sensitive Languages by Brunauer and Mühlbacher. They propose a concept called counter binding.
Indentation-Sensitive Parsing for Parsec by Adams and Agacan proposes an extension to PEGs to support indentation-sensitive languages. About Brunauer and Mühlbacher's approach, they write:

While encoding indentation sensitivity this way is formally precise, it comes at a cost. The YAML specification (Ben-Kiki et al. 2009) uses the approach proposed by Brunauer and Mühlbacher (2006) and as a result has about a dozen and a half different nonterminals for various sorts of whitespace and comments. With this encoding, the grammar cannot use a separate tokenizer and must be scannerless, each possible occurrence of whitespace must be explicit in the grammar, and the grammar must carefully track which non-terminals produce or expect what sorts of whitespace. The authors of the YAML grammar establish naming conventions for nonterminals that help manage this, but the result is still a grammar that is difficult to comprehend and even more difficult to modify.

I'd like to take the time to read this papers and think through the implications for Ohm. Maybe you're interested to read these as well!

MuhammedZakir · 2021-08-28T16:35:41Z

A Symbol-Based Extension of Parsing Expression Grammars and Context-Sensitive Packrat Parsing

Paper: https://dl.acm.org/doi/10.1145/3136014.3136025
Talk: https://www.youtube.com/watch?v=4QrZxfgZFDg
Implementation: http://nez-peg.github.io/
Example: https://github.com/nez-peg/nez-grammar/blob/master/xml-sym.nez

Edit: add link to paper.

All symol operations in SPEG:

Symbol Table Mutation
- <symbol A>: captures and binds A
- <reset A>: remove all A entries
Symbol Table Scoping
- <block e>: narrowing scope of all symbols in e
Symbol Table Matching (no state mutation)
- <match A>: matches with A
- <exists A>: tests whether A exists
- <is A>: tests whether the capture is A
- <isa A>: tests whether the capture is one of A's entries

--

Edit:

Might be helpful (haven't read this):

Is stateful packrat parsing really linear in practice? a counter-example, an improved grammar, and its parsing algorithms

https://dl.acm.org/doi/10.1145/3377555.3377898

MuhammedZakir · 2022-01-30T16:00:35Z

Any updates/thoughts? 👀

pdubroy · 2022-02-02T14:24:23Z

@MuhammedZakir Unfortunately I haven't had the time to investigate this deeply yet. But, the SPEG approach seems nice and I could imagine it fitting well into Ohm.

I do have some more time in the coming weeks/months so this may be something that I can dig into soon.

MuhammedZakir · 2022-02-24T14:39:54Z

Glad to hear that! :-)

FYI: Pest supports context-sensitive parsing: https://docs.rs/pest_derive/latest/pest_derive/#push-pop-drop-and-peek.

haikyuu · 2022-04-12T05:05:55Z

I'm trying to extend the grammar of Ohm to include some indentation specific operators as specified in the paper Indentation-sensitive parsing for Parsec also linked above.

This will change the following part of Ohm grammar:

-    Seq = Iter*
-
+    Seq = IterWithIndentation*
+
+  IndentationRel = Eq | Ge | Gt | Any
+  
+  Eq = "@="
+  Ge = "@>="
+  Gt = "@>"
+  Any = "@*"
+ 
+  IterWithIndentation
+    = Iter Ge -- indent_ge
+    | Iter Gt -- indent_gt
+    | Iter Eq -- indent_eq
+    | Iter Any -- indent_any
+    | Iter

With these changes, we can write the following grammar for (one of the productions of) while statement in python:

While_stm = "while"@> Test ":"@> Suite

Suite
  = newline@> Block@>
  | Stm_list newline@>

Block = Statement@=*

We can now write

while a < 100:
    a = a + 1
    a = a + 4
while a < 10: a=a+1

The operator ":"@>'s intent is to say that after ":", code is meant to be at a strictly greater indentation in order to be part of the Rule. That's why the second while is detected as a separate statement.
Block = Statement@=*'s intent is to say that the block is a sequence of statements that all have their first token start at the same column
X@>= is an extension of the @> that allows being at the same level of indentation
Y@* allows for tokens to be at any level of indentation. It would be used to specify the following code

b = """e
eee
1
"""

In order to be able to handle this logic, Ohm would need to make two properties available at each parse:

The allowed indentation set: e.g{2, 3 ...}. This covers @>, @>= and @*
A boolean flag for wether the current parse is inside a @= or not

Each parse needs these properties and produces a new set of properties for the next parse.
One thing to note is that terminals default to have @> in order to avoid making it cumbersome to write the grammar. That would be configurable but it's a sane default for an indentation sensitive language.

Side note

The suggested operators are two characters long, but I believe they read well @> :at a greater indentation and we could make the @ light gray in syntax highlighting to make it stand less or maybe even have it displayed like the paper using the mathematical form if the cursor isn't on it. But that's not time for these considerations yet.
Stm*^>

MuhammedZakir · 2022-04-12T17:27:54Z

Wouldn't a general method/operator such as the one I mentioned above [1] solves this?

[1] #158 (comment)

pdubroy · 2022-04-13T15:35:59Z

@haikyuu Thanks for taking a go at this! This is definitely an interesting experiment.

I've been thinking a bit in the past few weeks about how we can handle indentation-sensitive language, and other context-sensitive languages. I'll try to share some more substantial thoughts in the next day or two. For now, I can share my initial thoughts/impressions:

I find the indentation annotations — both in the paper and your example — to be fairly hard to read. What do you think? Do you have an ideas for a syntax that might be more readable? To me this is a pretty important factor.
All things being equal, I'd prefer to add a feature that could be used for other context-sensitive languages, not just indentation-sensitive languages. For example, the paper A Symbol-Based Extension of Parsing Expression Grammars and Context-Sensitive Packrat Parsing describes a more general feature that would be useful for things like parsing matching XML tags.

haikyuu · 2022-04-14T04:38:50Z

@pdubroy I have watched the SPEG video and I find it very powerful and readable syntax.

It is definitely more expressive. I'll read the paper and try to use it to assess its usability and readability.
It seems to be easier to implement as well as it requires adding one shared data structure that only the added operators access and mutate.

pdubroy · 2022-04-14T11:44:34Z

@haikyuu Sounds great! Btw, if I were to experiment with this in Ohm, I'd initially try to do this without changing any syntax at all. I'd create a "dummy" grammar with empty rules. Once it's instantiated in JS, I'd replace the rule bodies with a custom subclass of PExpr. Something like this:

const g = ohm.grammar(`
  SPEG {
    symbol<a> =
    is<a> =
    match<a> =
    ...
  }
`)
g.rules.symbol.body = new SpegSymbol();
g.rules.is.body = new SpegIs();
g.rules.match.body = new SpegMatch();

Then of course you'd have to implement the SpegSymbol, SpegIs, SpegMatch classes. The existing CaseInsensitiveTerminal class is a good example of a self-contained PExpr subclass.

Maybe that's helpful in case you or anyone else ends up experimenting with this.

todwang · 2022-06-22T11:16:52Z

Any progress？

haikyuu · 2022-06-22T11:44:13Z

No, I didn't make any progress. Feel free to jump on it if you will 🙏

pdubroy added the enhancement label Aug 27, 2020

DavidSouther mentioned this issue Jul 17, 2022

Implement Throw and Label operations #387

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for context-sensitive parsing #158

Allow for context-sensitive parsing #158

rtoal commented Sep 20, 2016 •

edited

Loading

MuhammedZakir commented Aug 8, 2021 •

edited

Loading

MuhammedZakir commented Aug 9, 2021

pdubroy commented Aug 10, 2021

MuhammedZakir commented Aug 28, 2021 •

edited

Loading

MuhammedZakir commented Jan 30, 2022

pdubroy commented Feb 2, 2022

MuhammedZakir commented Feb 24, 2022

haikyuu commented Apr 12, 2022 •

edited

Loading

MuhammedZakir commented Apr 12, 2022

pdubroy commented Apr 13, 2022

haikyuu commented Apr 14, 2022

pdubroy commented Apr 14, 2022 •

edited

Loading

todwang commented Jun 22, 2022

haikyuu commented Jun 22, 2022

Allow for context-sensitive parsing #158

Allow for context-sensitive parsing #158

Comments

rtoal commented Sep 20, 2016 • edited Loading

MuhammedZakir commented Aug 8, 2021 • edited Loading

MuhammedZakir commented Aug 9, 2021

pdubroy commented Aug 10, 2021

MuhammedZakir commented Aug 28, 2021 • edited Loading

MuhammedZakir commented Jan 30, 2022

pdubroy commented Feb 2, 2022

MuhammedZakir commented Feb 24, 2022

haikyuu commented Apr 12, 2022 • edited Loading

Side note

MuhammedZakir commented Apr 12, 2022

pdubroy commented Apr 13, 2022

haikyuu commented Apr 14, 2022

pdubroy commented Apr 14, 2022 • edited Loading

todwang commented Jun 22, 2022

haikyuu commented Jun 22, 2022

rtoal commented Sep 20, 2016 •

edited

Loading

MuhammedZakir commented Aug 8, 2021 •

edited

Loading

MuhammedZakir commented Aug 28, 2021 •

edited

Loading

haikyuu commented Apr 12, 2022 •

edited

Loading

pdubroy commented Apr 14, 2022 •

edited

Loading