One of tree-sitter's notable features is its ability to "recover from errors". Anecdotal evidence suggests that this works well in many cases.
Unfortunately, according to the tree-sitter creator, for the other cases, there does not appear to be much / anything one can do at the moment:
Yeah, there is intentionally not much that you as a grammar author can do to manipulate the error recovery.
though he has mentioned that:
I do think there is a clear path to fixing these things within Tree-sitter, and I do plan to work on this eventually, but it'll take some significant focused time, so it may be a while. [...elided...] I wish I had more time to fix this right now.
One of the explanations of error-recovery given by the tree-sitter creator is:
The basic idea is that the parser accepts any string. If the string does not match the grammar, the parser corrects it to match the grammar. Currently, there are two types of corrections:
- Skip some number of tokens or subtrees
- Insert one missing token
Obviously, for a given syntax error, there are multiple combinations of corrections that would work. The parser handles this decision similarly to how ambiguities are handled by the classic GLR algorithm: the parse stack forks into multiple branches, and tries a different possibility on each branch.
We maintain two quantities associated with each branch of the parse stack:
- error_cost - an integer measure of the number of subtrees skipped, the total size of the subtrees skipped, and the number of missing tokens inserted.
- node_count - the number of valid syntax tree nodes that have been added, since the last error.
There is some heuristic logic for deciding when to discard branches of the parse stack based on their error cost and their node count. When the node count is small, it means that we've just recently encountered an error, so we want to allow a fairly large number of different branches to "compete", to see which one will turn out the best. When the node count is large, it means we have progressed far past the error, so we should allow very few different branches (eventually just one), in order to restore good parsing performance.
At one point, he gave a sketch of an approach to providing some "customization":
Yeah, I'd still like to add support for generating error messages in Tree-sitter. I think it would "fit" in fine to the current library; it'd probably be a function that takes an ERROR node and returns some structured information about the location where the error was first detected, and which state the parser was in when the error was detected. From there, you could call another function to iterate over all of the valid symbols in that parse state.
Similar content, but perhaps nice for fans or those looking for tiny hints? (^^;
- Error recovery (35:15) in talk at GitHub Universe 2017
- Error recovery (31:33) in talk at Strange Loop 2018
- Error recovery (15:35) in talk at FOSDEDM 2018
- Error Detection and Recovery in LR Parsers - via archive.org - (2012?) - listed at tree-sitter.github.io
- Error Recovery for LR Parsers - (1977) - listed at tree-sitter.github.io
- Chapter 3 "Error recovery" - Lukas Diekmann's PhD Thesis "Editing composed languages" - (2019) - may contain bits that didn't influence tree-sitter
- #26 Overhaul error recovery
- #84 Avoid hangs during error recovery
- #98 Tokens that match the empty string can result in infinite loops during error recovery
- #101 Use a simpler approach to error recovery
- #119 Add the ability to recover from errors by inserting missing tokens
- #139 Sublime Syntax compatibility
- #155 Fix some error recovery problems - explanation of some details including pausing and selection of stack to process
- #168 Don't reuse nodes within ambiguities
- #224 Error recovery strategy
- #255 Helpful parser error messages
- #288 parser in the infinite loop
- #568 Unsatisfactory error recovery?
- #921 Generate "incomplete" nodes in the presence of syntax errors?
- #1205 Is there any way to give hints to the error recovery process?
- #1259 tree-sitter calls external scanner with all symbols marked valid
- #1617 Error Recovery at end of file stuck in infinite loop
- #1631 Unable to use tree sitter as parser for compiler - possible future extension
- #1635 Add 'reserved word' construct: improve error recovery by avoiding treating reserved words as other tokens
- #1645 Incorrect incremental parse result found in Python test corpus
- #1783 Allow empty external tokens during error recovery, if they change the scanner's state
- #1870 How does one improve the error recovery of a grammar?
- #1993 Adding keyword extraction breaks MISSING node