Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

How to lex XML comments with regex #421

Closed
romamik opened this issue Sep 13, 2024 · 8 comments
Closed

How to lex XML comments with regex #421

romamik opened this issue Sep 13, 2024 · 8 comments
Labels
question Further information is requested

Comments

@romamik
Copy link
Contributor

romamik commented Sep 13, 2024

I tried to create an XML comment regex. It works correctly with JS, for example, but fails with logos.
There are two problems, and I feel they are connected.

  1. I had to add --[^->]|-{3,}[^->] instead of just |-{2,}[^->] as it did not work this way.
  2. It has -{2,}> in the end but only accepts --> as the end, but not ---> or ---->.
use logos::Logos;

#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"\s*")] // Ignore this regex pattern between tokens
enum Token {
    /*
       <!-{2,} - match start of the comment with any number of hyphens
       (
           [^-] - match anything that is not a hyphen
           |-[^-] - single hyphen
           |--[^->] - two hyphens
           |-{3,}[^->] - three or more hyphens
       )* - match anything that is not a comment end
       -{2,}> - match end of the comment with any number of hyphens (this does not work)
    */
    #[regex(r"<!-{2,}([^-]|-[^-]|--[^->]|-{3,}[^->])*-{2,}>")]
    XmlComment,
}

fn main() {
    let mut lex = Token::lexer(
        r#"   
			<!--- this is a comment, it contains - -- --- ---- and -> and it is ok -->
			<!--- this is a comment and it failes as it has more than two hypens in the end --->
		"#,
    );

    dbg!(lex.next());
    dbg!(lex.next());
    dbg!(lex.next());
}
@jeertmans
Copy link
Collaborator

Hello @romamik! I think your issue is generally related to the fact that Logos doesn't like "greedy patterns". I.e., you should usually use negation instead to find the next of a comment. However, you could also probably used callbacks and bumps. Here is some pseudocode:

fn callback(lex: &'source mut Lexer<Token>) -> Result<&'source str, MyError> {
    let slice = lex.slice(); // e.g., '<----'
    let pattern = format!("{}>", slice[1..]); // e.g., '---->'
    lex.remainder()
        .find(pattern)
        .map(|index| {
            lex.bump(index + pattern.len());
            lex.slice()
        })
        .ok_or(MyError::UnterminatedComment)
}

#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"[ \t\n\f]+")]
enum Token<'source> {
    #[regex("<!-{2,}", callback)]
    Comment(&'source str),
    ...
}

@romamik
Copy link
Contributor Author

romamik commented Sep 15, 2024

I wish I knew about bump earlier. It is useful not only in this context, but in other situations as well.

@jeertmans
Copy link
Collaborator

I am happy it solved your issue :-)

@romamik, can you include a minimal code snippet with your solution? So other people coming across this issue will know how you managed it :D

@jeertmans jeertmans added the question Further information is requested label Sep 15, 2024
@jeertmans jeertmans changed the title xml comment regex How to lex XML comments with regex Sep 15, 2024
@romamik
Copy link
Contributor Author

romamik commented Sep 15, 2024

use logos::{Lexer, Logos};

#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"\s*")] // Ignore this regex pattern between tokens
enum Token<'src> {
    #[token("<!--", |lex| skip_comment(lex))]
    XmlComment(&'src str),
}

fn skip_comment<'src>(lex: &mut Lexer<'src, Token<'src>>) -> Result<&'src str, ()> {
    let mut open_count = 1;
    loop {
        let rem = lex.remainder();
        let close_pos = rem.find("-->").ok_or(())?;
        let open_pos = rem[..close_pos].find("<!--");
        if let Some(open_pos) = open_pos {
            open_count += 1;
            lex.bump(open_pos + 4);
            continue;
        }
        lex.bump(close_pos + 3);
        open_count -= 1;
        if open_count == 0 {
            break;
        }
    }
    Ok(lex.slice())
}

fn main() {
    let mut lex = Token::lexer(
        r#"   
			<!--- this is a comment, it contains - -- --- ---- and -> and it is ok -->
			<!--- this is a comment and it has more than two hypens in the end --->
			<!-- <!-- nested comment --> -->
		"#,
    );

    dbg!(lex.next());
    dbg!(lex.next());
    dbg!(lex.next());
    dbg!(lex.next());
}

Console output:

[src/main.rs:39:5] lex.next() = Some(
    Ok(
        XmlComment(
            "<!--- this is a comment, it contains - -- --- ---- and -> and it is ok -->",
        ),
    ),
)
[src/main.rs:40:5] lex.next() = Some(
    Ok(
        XmlComment(
            "<!--- this is a comment and it has more than two hypens in the end --->",
        ),
    ),
)
[src/main.rs:41:5] lex.next() = Some(
    Ok(
        XmlComment(
            "<!-- <!-- nested comment --> -->",
        ),
    ),
)
[src/main.rs:42:5] lex.next() = None

@jeertmans
Copy link
Collaborator

Thanks @romamik! Can you include the terminal output to show the debug prints?

@romamik
Copy link
Contributor Author

romamik commented Sep 15, 2024

I've edited the message to include debug prints.

@jeertmans
Copy link
Collaborator

Thank you! Closing now :-)

@romamik
Copy link
Contributor Author

romamik commented Oct 24, 2024

I just discovered, that it is possible to skip comments in Lexer instead of ignoring them in code later. It is stated in the documentation, but I missed is somehow.

use logos::{Lexer, Logos, Skip, Span};

#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"\s*")] // Ignore this regex pattern between tokens
enum Token<'src> {
    #[token("<!--", |lex| skip_comment(lex))]
    XmlComment,

    #[regex("<[a-zA-Z0-9-]+", |lex| &lex.slice()[1..])]
    TagOpen(&'src str),

    #[token("/>")]
    TagClose,
}

fn skip_comment<'src>(lex: &mut Lexer<'src, Token<'src>>) -> Skip {
    let mut open_count = 1;
    loop {
        let rem = lex.remainder();
        let close_pos = rem.find("-->").expect("unterminated comment");
        let open_pos = rem[..close_pos].find("<!--");
        if let Some(open_pos) = open_pos {
            open_count += 1;
            lex.bump(open_pos + 4);
            continue;
        }
        lex.bump(close_pos + 3);
        open_count -= 1;
        if open_count == 0 {
            break;
        }
    }
    Skip
}

fn main() {
    let mut lex = Token::lexer(
        r#"   
			<!--- this is a comment -->
        <tag-name
			<!--- this is a comment inside tag (that's weird, but why not?) --->
        />
		"#,
    );

    assert_eq!(lex.next(), Some(Ok(Token::TagOpen("tag-name"))));
    assert_eq!(lex.next(), Some(Ok(Token::TagClose)));
    assert_eq!(lex.next(), None);
}

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants