File gets parsed correctly, but the offset information indicating symbol location seems wrong #238

dmalyuta · 2021-01-29T15:38:47Z

I've been exploring using StaticLint.jl to do static error analysis on my code, but an issue that I'm running into (and I think it has to do with CSTParser) is that the file offsets that CSTParser outputs for where commands are located in the file are not exactly the beginning and end of the commands. Sometimes it seems quite arbitrary. Consider the following super simple file:

using LinearAlgebra



include("foo.jl")



include("bar.jl")

I parse the file using the following code:

using CSTParser

const root_filepath = "/tmp/julia_test/c.jl"
source = read(root_filepath, String)
cst = CSTParser.parse(source, true)

println(cst)

And this is what it outputs:

  1:62  file
  1:23   using
  1:17      1:0   OP: .
  1:17     LinearAlgebra
 24:44   call
 24:30    include
 31:38    STRING: foo.jl
 45:62   call
 45:51    include
 52:59    STRING: bar.jl

My understanding is that on the left we have information of the form <starting offset>:<ending offset> where offset is basically the character number starting from the beginning of the file, that points to the beginning and ending of the corresponding operation/string/variable/etc. So take for example the above output:

LinearAlgebra is said to reside on offsets 1:17. In fact, it is on offsets 7:19
An include command is said to reside on offsets 24:30. This is correct!
The string foo.jl is said to reside on offsets 31:38. In fact, it is on offsets 32:39

So you see - already in this simple file there seem to be three cases, each with different behaviour - a "really bad" error (LinearAlgebra), an error that is an offset of one character (foo.jl), and a correct answer (include).

Perhaps I do not fully understand how CSTParser works, in which case I'd really appreciate an explanation. At the end of the day, I would like to get to a state where StaticLint.jl return the correct file offset pointing to the error in the file. I believe that it is currently not doing so because CSTParser returns this weird-looking information.

My system info:

OS: Ubuntu 20.04.1 LTS
Julia: 1.5.3
CSTParser.jl: [00ebfdb7] CSTParser v3.1.1-DEV https://github.com/julia-vscode/CSTParser.jl#master
Editor (this shouldn't matter, I think): GNU Emacs 27.1.90

The text was updated successfully, but these errors were encountered:

pfitzseb · 2021-01-29T16:57:08Z

This is just an error with printing, basically:

julia> cst = CSTParser.parse(
       """
       using LinearAlgebra



       include("foo.jl")



       include("bar.jl")
       """, true);

julia> function print_cst(cst, offset)
         for a in cst
           if a.args === nothing
             println(offset, ":", offset+a.span, "\t", CSTParser.valof(a))
             offset += a.fullspan
           else
             offset = print_cst(a, offset)
           end
         end
         return offset
       end
print_cst (generic function with 1 method)

julia> print_cst(cst, 1);
1:6     using
7:20    LinearAlgebra
24:31   include
31:32   (
32:40   foo.jl
40:41   )
45:52   include
52:53   (
53:61   bar.jl
61:62   )

- Applied the suggestion in julia-vscode/CSTParser.jl#238

dmalyuta · 2021-01-29T20:35:10Z

Thanks! This seems to work well, however I noticed that it still bugs with unicode characters. Basically, for some reason their contribution to "span" and "fullspan" is +2 for each character instead of +1. I updated your function with the following "hack":

using CSTParser
using CSTParser: EXPR

function fix_string_span_unicode!(e::EXPR)
    val = CSTParser.valof(e)
    if typeof(val)==String && !isascii(val)
        # Hacky length fix
        for c in val
            if !isascii(c)
                e.span -= 1
                e.fullspan -= 1
            end
        end
    end
end

function print_cst(cst, offset)
    for a in cst
        if a.args === nothing
            fix_string_span_unicode!(a)
            println(offset, ":", offset+a.span, "\t", CSTParser.valof(a))
            offset += a.fullspan
        else
            offset = print_cst(a, offset)
        end
    end
    return offset
end

ZacLN · 2021-01-29T20:44:19Z

^^ all size measurements (span/fullspan) are in bytes rather than characters

dmalyuta · 2021-01-29T21:18:31Z

I see. That explains the values, but still it seems like the wrong measurement to use if you want to annotate where the error appears in a string? Unless there is a function that can convert bytes to character position (I'm pretty new to Julia, so I don't know of one).

You could modify the previous code to use:

function fix_string_span_unicode!(e::EXPR)
    val = CSTParser.valof(e)
    if typeof(val)==String
        # Hacky length fix
        diff = sizeof(val)-length(val)
        e.span -= diff
        e.fullspan -= diff
    end
end

davidanthoff · 2021-01-29T22:56:07Z

To get the character index for a given byte index for a string, do this: length(s, 1, byte_index).

Julia string handling, especially when it comes to Unicode, is good, but probably requires you to read the manual section :) It is here and quite helpful.

- Use byte to character count conversion as suggested by julia-vscode/CSTParser.jl#238 (comment) - TODO can improve not reading the file string each time

dmalyuta · 2021-01-30T00:59:34Z

Fantastic! That suggestion works, thanks a lot. Just for your information, I am building a Flycheck parser for Julia in Emacs using StaticLint.jl in this repo (where I incorporated your advice): https://github.com/dmalyuta/julia-staticlint

davidanthoff · 2021-01-30T01:23:45Z

Very nice!

pfitzseb added the bug label Jan 29, 2021

davidanthoff added this to the Backlog milestone Jan 29, 2021

dmalyuta added a commit to dmalyuta/julia-staticlint that referenced this issue Jan 29, 2021

Fix error positions returned by CSTParser

4c9a411

- Applied the suggestion in julia-vscode/CSTParser.jl#238

dmalyuta mentioned this issue Jan 29, 2021

Error position information is incorrect (due to bug in CSTParser) julia-vscode/StaticLint.jl#253

Open

davidanthoff removed this from the Backlog milestone Oct 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File gets parsed correctly, but the offset information indicating symbol location seems wrong #238

File gets parsed correctly, but the offset information indicating symbol location seems wrong #238

dmalyuta commented Jan 29, 2021 •

edited

Loading

pfitzseb commented Jan 29, 2021

dmalyuta commented Jan 29, 2021

ZacLN commented Jan 29, 2021

dmalyuta commented Jan 29, 2021

davidanthoff commented Jan 29, 2021

dmalyuta commented Jan 30, 2021 •

edited

Loading

davidanthoff commented Jan 30, 2021

File gets parsed correctly, but the offset information indicating symbol location seems wrong #238

File gets parsed correctly, but the offset information indicating symbol location seems wrong #238

Comments

dmalyuta commented Jan 29, 2021 • edited Loading

pfitzseb commented Jan 29, 2021

dmalyuta commented Jan 29, 2021

ZacLN commented Jan 29, 2021

dmalyuta commented Jan 29, 2021

davidanthoff commented Jan 29, 2021

dmalyuta commented Jan 30, 2021 • edited Loading

davidanthoff commented Jan 30, 2021

dmalyuta commented Jan 29, 2021 •

edited

Loading

dmalyuta commented Jan 30, 2021 •

edited

Loading