-
-
Notifications
You must be signed in to change notification settings - Fork 227
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Various parsing issues #802
Comments
Handling of some symbolsIn clj I get the following results:
The lexer I tested recognized as 'junk':
I expected recognition as
|
Handling of some number literalsThe following are some things that are parsed as
For reference, parcera currently parses numbers like this: https://github.com/carocad/parcera/blob/83cd988e69116b67c620c099f78b693ac5e37233/src/Clojure.g4#L46 tree-sitter-clojure takes a mostly similar approach: https://github.com/sogaiu/tree-sitter-clojure/blob/9df53ae75475e5bdbeb21cd297b8e3160f3b6ed8/grammar.js#L21-L65 |
Handling of some character literalsSome character literals appear to be split apart and recognized as
I expected a single |
Handling of some symbolic valuesSymbolic values expressed in a certain form (e.g. space between
Note that:
I expected results like:
This would be similar to what I currently get for:
I don't understand how Calva works well enough to have an opinion about what the result should be, but it seems that There is this case too:
I currently get:
So may be there is a case for the first two being something like |
Handling of character literal + comment sequenceIn
in the local lexer:
It seems that Similar results were seen for |
Regarding:
I think it can be simplified as
(My emphasis.) This non-numeric start rule was, for some reason, not implemented in Calva's Clojure syntax. Adding that constraint immediately made Calva parse So, it is not general as in clj::user=> (def false#_foo :false#_foo)
#'user/false#_foo
clj::user=> false#_foo
:false#_foo (Smiling at the irony that the Github Clojure syntax disagrees 😄 ) Anyway, super good find! I'll try to find some more time to spend on this awesome list. |
One reason to express the discard example as BTW, Calva was not alone -- there was some discussion concerning this here: carocad/parcera#86 |
That's very considerate of you. 😄 In which cases did it hang the REPL? I think I managed to fix the |
In
Note, that there is no prompt printed after the One can get out of this situation by typing, say a I'm not sure I understand why
May be I misunderstood or this is part of Calva being more generous in what it accepts? |
It is due to that Calva treats a lot of quoting/splicing/etcetera chars to prefix symbols (this a reason why we call them |
Thank you for the explanation. I was able to update my idea of the intent of |
Handling of whitespaceI'm not sure if the following will work for copy-paste, buf FWIW:
There is a character between In Python one can enter it like: Ah, I guess in JS one may enter it in a similar way, so that might be:
What this is a demonstration of is what Clojure considers whitespace: https://github.com/clojure/clojure/blob/833c924239a818ff1a2563ae88af6dc266b35a61/src/jvm/clojure/lang/LispReader.java#L131 So it's either a comma or what java.lang.Character's I've summarized my understanding of what counts here: https://github.com/sogaiu/tree-sitter-clojure/blob/f8006afc91296b0cdb09bfa04e08a6b3347e5962/grammar.js#L6-L32 For U+1680, U+2000, U+2001, U+2002, U+2003, U+2004, U+2005, U+2006, U+2008, U+2009, U+200A, U+205F, U+3000, I get For U+2028 and U+2029, I get an exception / error "Unexpected character" -- may be JS(?) cannot handle those. My testing was with For U+001C, U+001D, U+001E, U+001F, I get For U+000B and For U+0020 (space), For an upstream reference there is: https://docs.oracle.com/javase/7/docs/api/java/lang/Character.html#isWhitespace(char) I looked at that (or possibly a version for another JDK) and then went looking for Unicode docs to figure out what the bits meant. For Unicode info, I looked at: |
This is super! Should be pretty easy to through all that into my |
Yes, famous last words. But I figured it out. 😄 Extra good that you found that there are a lot of unicode not matched by |
Reader (symbolic value?) and comment
For cd2ffed, I get:
Looks like the reader has "absorbed" a semicolon. |
N-numeric literal split
With 943a4b1 I get:
|
Thanks! Looking at this I found another one: Calva treats something like |
Good catch! I think the generators here will put a leading zero for hex and octal, but not for double or ratio. Perhaps I should add that :) IIUC, multiple leading zeros work (only for double and ratio). (FWIW, my limited testing and understanding suggest that radix numbers can't start with zero.) Looks like if one does |
May be these have all been addressed? I'll close this for now :) |
TBH, I don't know if all has been addressed. But as I recall things, I addressed most of it. It was super valuable to Calva. Belated thanks! |
As discussed in #calva, we'll use this issue to collect some parsing issues that are discovered for the time being.
Handling of number before discard / ignore marker
In
clj
I get the following results:So it appears that in each case,
1
(or+1
) is being recognized as a number, and then starting from#_
, there is a discard expression that extends to include the2
.I tried a similar sequence for Calva's clojure-lexer and found:
It looks like in these cases
#_
is being seen as part of what comes immediately before.Note that:
seems correct, as in
clj
, one gets:So far I think for numbers and delimiters of collections, one doesn't need to put a space before
#_
for there to be appropriate recognition of a following discard expression.However, for characters, symbols, keywords, and symbolic values (e.g.
##NaN
), not having a space makes a difference in what ends up being recognized.It may be obvious, but this analysis may not be complete.
The text was updated successfully, but these errors were encountered: