How to lex XML comments with regex #421

romamik · 2024-09-13T05:55:31Z

I tried to create an XML comment regex. It works correctly with JS, for example, but fails with logos.
There are two problems, and I feel they are connected.

I had to add --[^->]|-{3,}[^->] instead of just |-{2,}[^->] as it did not work this way.
It has -{2,}> in the end but only accepts --> as the end, but not ---> or ---->.

use logos::Logos;

#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"\s*")] // Ignore this regex pattern between tokens
enum Token {
    /*
       <!-{2,} - match start of the comment with any number of hyphens
       (
           [^-] - match anything that is not a hyphen
           |-[^-] - single hyphen
           |--[^->] - two hyphens
           |-{3,}[^->] - three or more hyphens
       )* - match anything that is not a comment end
       -{2,}> - match end of the comment with any number of hyphens (this does not work)
    */
    #[regex(r"<!-{2,}([^-]|-[^-]|--[^->]|-{3,}[^->])*-{2,}>")]
    XmlComment,
}

fn main() {
    let mut lex = Token::lexer(
        r#"   
			<!--- this is a comment, it contains - -- --- ---- and -> and it is ok -->
			<!--- this is a comment and it failes as it has more than two hypens in the end --->
		"#,
    );

    dbg!(lex.next());
    dbg!(lex.next());
    dbg!(lex.next());
}

The text was updated successfully, but these errors were encountered:

jeertmans · 2024-09-13T07:57:55Z

Hello @romamik! I think your issue is generally related to the fact that Logos doesn't like "greedy patterns". I.e., you should usually use negation instead to find the next of a comment. However, you could also probably used callbacks and bumps. Here is some pseudocode:

fn callback(lex: &'source mut Lexer<Token>) -> Result<&'source str, MyError> {
    let slice = lex.slice(); // e.g., '<----'
    let pattern = format!("{}>", slice[1..]); // e.g., '---->'
    lex.remainder()
        .find(pattern)
        .map(|index| {
            lex.bump(index + pattern.len());
            lex.slice()
        })
        .ok_or(MyError::UnterminatedComment)
}

#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"[ \t\n\f]+")]
enum Token<'source> {
    #[regex("<!-{2,}", callback)]
    Comment(&'source str),
    ...
}

romamik · 2024-09-15T07:59:29Z

I wish I knew about bump earlier. It is useful not only in this context, but in other situations as well.

jeertmans · 2024-09-15T09:07:59Z

I am happy it solved your issue :-)

@romamik, can you include a minimal code snippet with your solution? So other people coming across this issue will know how you managed it :D

romamik · 2024-09-15T09:13:56Z

use logos::{Lexer, Logos};

#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"\s*")] // Ignore this regex pattern between tokens
enum Token<'src> {
    #[token("<!--", |lex| skip_comment(lex))]
    XmlComment(&'src str),
}

fn skip_comment<'src>(lex: &mut Lexer<'src, Token<'src>>) -> Result<&'src str, ()> {
    let mut open_count = 1;
    loop {
        let rem = lex.remainder();
        let close_pos = rem.find("-->").ok_or(())?;
        let open_pos = rem[..close_pos].find("<!--");
        if let Some(open_pos) = open_pos {
            open_count += 1;
            lex.bump(open_pos + 4);
            continue;
        }
        lex.bump(close_pos + 3);
        open_count -= 1;
        if open_count == 0 {
            break;
        }
    }
    Ok(lex.slice())
}

fn main() {
    let mut lex = Token::lexer(
        r#"   
			<!--- this is a comment, it contains - -- --- ---- and -> and it is ok -->
			<!--- this is a comment and it has more than two hypens in the end --->
			<!-- <!-- nested comment --> -->
		"#,
    );

    dbg!(lex.next());
    dbg!(lex.next());
    dbg!(lex.next());
    dbg!(lex.next());
}

Console output:

[src/main.rs:39:5] lex.next() = Some(
    Ok(
        XmlComment(
            "<!--- this is a comment, it contains - -- --- ---- and -> and it is ok -->",
        ),
    ),
)
[src/main.rs:40:5] lex.next() = Some(
    Ok(
        XmlComment(
            "<!--- this is a comment and it has more than two hypens in the end --->",
        ),
    ),
)
[src/main.rs:41:5] lex.next() = Some(
    Ok(
        XmlComment(
            "<!-- <!-- nested comment --> -->",
        ),
    ),
)
[src/main.rs:42:5] lex.next() = None

jeertmans · 2024-09-15T09:21:25Z

Thanks @romamik! Can you include the terminal output to show the debug prints?

romamik · 2024-09-15T09:23:19Z

I've edited the message to include debug prints.

jeertmans · 2024-09-15T09:35:22Z

Thank you! Closing now :-)

romamik · 2024-10-24T08:11:29Z

I just discovered, that it is possible to skip comments in Lexer instead of ignoring them in code later. It is stated in the documentation, but I missed is somehow.

use logos::{Lexer, Logos, Skip, Span};

#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"\s*")] // Ignore this regex pattern between tokens
enum Token<'src> {
    #[token("<!--", |lex| skip_comment(lex))]
    XmlComment,

    #[regex("<[a-zA-Z0-9-]+", |lex| &lex.slice()[1..])]
    TagOpen(&'src str),

    #[token("/>")]
    TagClose,
}

fn skip_comment<'src>(lex: &mut Lexer<'src, Token<'src>>) -> Skip {
    let mut open_count = 1;
    loop {
        let rem = lex.remainder();
        let close_pos = rem.find("-->").expect("unterminated comment");
        let open_pos = rem[..close_pos].find("<!--");
        if let Some(open_pos) = open_pos {
            open_count += 1;
            lex.bump(open_pos + 4);
            continue;
        }
        lex.bump(close_pos + 3);
        open_count -= 1;
        if open_count == 0 {
            break;
        }
    }
    Skip
}

fn main() {
    let mut lex = Token::lexer(
        r#"   
			<!--- this is a comment -->
        <tag-name
			<!--- this is a comment inside tag (that's weird, but why not?) --->
        />
		"#,
    );

    assert_eq!(lex.next(), Some(Ok(Token::TagOpen("tag-name"))));
    assert_eq!(lex.next(), Some(Ok(Token::TagClose)));
    assert_eq!(lex.next(), None);
}

jeertmans added the question Further information is requested label Sep 15, 2024

jeertmans changed the title ~~xml comment regex~~ How to lex XML comments with regex Sep 15, 2024

jeertmans closed this as completed Sep 15, 2024

This was referenced Sep 26, 2024

Help needed: matching Python-like multiline strings #330

Closed

Runtime stack overflow when lexing certain strings #424

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to lex XML comments with regex #421

How to lex XML comments with regex #421

romamik commented Sep 13, 2024

jeertmans commented Sep 13, 2024

romamik commented Sep 15, 2024

jeertmans commented Sep 15, 2024

romamik commented Sep 15, 2024 •

edited

Loading

jeertmans commented Sep 15, 2024

romamik commented Sep 15, 2024

jeertmans commented Sep 15, 2024

romamik commented Oct 24, 2024

How to lex XML comments with regex #421

How to lex XML comments with regex #421

Comments

romamik commented Sep 13, 2024

jeertmans commented Sep 13, 2024

romamik commented Sep 15, 2024

jeertmans commented Sep 15, 2024

romamik commented Sep 15, 2024 • edited Loading

jeertmans commented Sep 15, 2024

romamik commented Sep 15, 2024

jeertmans commented Sep 15, 2024

romamik commented Oct 24, 2024

romamik commented Sep 15, 2024 •

edited

Loading