Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

GCC preprocessor output generated in non-ASCII locales cannot be processed #72

Open
arsdragonfly opened this issue May 5, 2020 · 2 comments

Comments

@arsdragonfly
Copy link

see this issue

@expipiplus1
Copy link
Collaborator

Hopefully this should just be a simple change in the lexer. PR's welcome!

@mtolly
Copy link

mtolly commented Jan 10, 2021

So, I looked into this and I think I found the fix, but Alex might need to release a bug fix first.

I saved the sample from the linked issue as a UTF-8 file:

# 1 "test.c"
# 1 "<built-in>"
# 1 "<命令行>"
# 31 "<命令行>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 32 "<命令行>" 2
# 1 "test.c"
int main()
{
 return 0;
}

And sure enough got Prelude.head: empty list. The error comes from the second usage of head at this location, and is caused by the first non-ASCII line # 1 "<命令行>".

Basically the problem is that Alex is assuming the input bytestring is UTF-8, but the InputStream is a byte-by-byte abstraction (effectively Latin-1). In these lines:

\#$space*@digits$space*(\"($infname|@charesc)*\"$space*)?(@int$space*)*\r?$eol
  { \pos len str -> setPos (adjustLineDirective len (takeChars len str) pos) >> lexToken' False }

Alex is passing 12 for len, which is the correct Unicode codepoint length of # 1 "<命令行>" plus a newline at the end. But takeChars then takes 12 bytes off the bytestring, so adjustLineDirective receives a broken string which does not include the double quote at the end.

The correct fix is to put Alex back into Latin-1 mode (my impression is that this was the default previously, but was then switched in Alex 3.0). This is done with the %encoding "latin1" directive (added in Alex 3.1.7). However, it still doesn't work because there was a remaining bug in character counting that caused it to still pass the too-short length. This was fixed in haskell/alex#156 but even though that was merged a year ago it appears to not have made it into the recent Alex 3.2.6. So, I'll ping that to see when it can be released.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

3 participants