-
-
Notifications
You must be signed in to change notification settings - Fork 131
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
How to lex XML comments with regex #421
Comments
Hello @romamik! I think your issue is generally related to the fact that Logos doesn't like "greedy patterns". I.e., you should usually use negation instead to find the next of a comment. However, you could also probably used callbacks and bumps. Here is some pseudocode: fn callback(lex: &'source mut Lexer<Token>) -> Result<&'source str, MyError> {
let slice = lex.slice(); // e.g., '<----'
let pattern = format!("{}>", slice[1..]); // e.g., '---->'
lex.remainder()
.find(pattern)
.map(|index| {
lex.bump(index + pattern.len());
lex.slice()
})
.ok_or(MyError::UnterminatedComment)
}
#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"[ \t\n\f]+")]
enum Token<'source> {
#[regex("<!-{2,}", callback)]
Comment(&'source str),
...
} |
I wish I knew about |
I am happy it solved your issue :-) @romamik, can you include a minimal code snippet with your solution? So other people coming across this issue will know how you managed it :D |
use logos::{Lexer, Logos};
#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"\s*")] // Ignore this regex pattern between tokens
enum Token<'src> {
#[token("<!--", |lex| skip_comment(lex))]
XmlComment(&'src str),
}
fn skip_comment<'src>(lex: &mut Lexer<'src, Token<'src>>) -> Result<&'src str, ()> {
let mut open_count = 1;
loop {
let rem = lex.remainder();
let close_pos = rem.find("-->").ok_or(())?;
let open_pos = rem[..close_pos].find("<!--");
if let Some(open_pos) = open_pos {
open_count += 1;
lex.bump(open_pos + 4);
continue;
}
lex.bump(close_pos + 3);
open_count -= 1;
if open_count == 0 {
break;
}
}
Ok(lex.slice())
}
fn main() {
let mut lex = Token::lexer(
r#"
<!--- this is a comment, it contains - -- --- ---- and -> and it is ok -->
<!--- this is a comment and it has more than two hypens in the end --->
<!-- <!-- nested comment --> -->
"#,
);
dbg!(lex.next());
dbg!(lex.next());
dbg!(lex.next());
dbg!(lex.next());
} Console output:
|
Thanks @romamik! Can you include the terminal output to show the debug prints? |
I've edited the message to include debug prints. |
Thank you! Closing now :-) |
I just discovered, that it is possible to skip comments in Lexer instead of ignoring them in code later. It is stated in the documentation, but I missed is somehow. use logos::{Lexer, Logos, Skip, Span};
#[derive(Logos, Debug, PartialEq)]
#[logos(skip r"\s*")] // Ignore this regex pattern between tokens
enum Token<'src> {
#[token("<!--", |lex| skip_comment(lex))]
XmlComment,
#[regex("<[a-zA-Z0-9-]+", |lex| &lex.slice()[1..])]
TagOpen(&'src str),
#[token("/>")]
TagClose,
}
fn skip_comment<'src>(lex: &mut Lexer<'src, Token<'src>>) -> Skip {
let mut open_count = 1;
loop {
let rem = lex.remainder();
let close_pos = rem.find("-->").expect("unterminated comment");
let open_pos = rem[..close_pos].find("<!--");
if let Some(open_pos) = open_pos {
open_count += 1;
lex.bump(open_pos + 4);
continue;
}
lex.bump(close_pos + 3);
open_count -= 1;
if open_count == 0 {
break;
}
}
Skip
}
fn main() {
let mut lex = Token::lexer(
r#"
<!--- this is a comment -->
<tag-name
<!--- this is a comment inside tag (that's weird, but why not?) --->
/>
"#,
);
assert_eq!(lex.next(), Some(Ok(Token::TagOpen("tag-name"))));
assert_eq!(lex.next(), Some(Ok(Token::TagClose)));
assert_eq!(lex.next(), None);
} |
I tried to create an XML comment regex. It works correctly with JS, for example, but fails with logos.
There are two problems, and I feel they are connected.
--[^->]|-{3,}[^->]
instead of just|-{2,}[^->]
as it did not work this way.-{2,}>
in the end but only accepts-->
as the end, but not--->
or---->
.The text was updated successfully, but these errors were encountered: