Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Exception: Encountered <C_ESC_STRING>, Expected <STRING> #136

Closed
ThomasChr opened this issue Sep 19, 2022 · 9 comments · Fixed by #139
Closed

Exception: Encountered <C_ESC_STRING>, Expected <STRING> #136

ThomasChr opened this issue Sep 19, 2022 · 9 comments · Fixed by #139

Comments

@ThomasChr
Copy link
Contributor

ThomasChr commented Sep 19, 2022

I've got a mail which throws the following error:

2022-09-19T12:07:32,092 ERROR (AutoMBox.java:43): Exception: net.sourceforge.MSGViewer.rtfparser.ParseException: Encountered " <C_ESC_STRING> "\\\'b7 "" at line 608, column 160.
Was expecting:
    <STRING> ...

net.sourceforge.MSGViewer.rtfparser.ParseException: Encountered " <C_ESC_STRING> "\\\'b7 "" at line 608, column 160.
Was expecting:
    <STRING> ...

        at net.sourceforge.MSGViewer.rtfparser.RTFParser.generateParseException(RTFParser.java:394) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.rtfparser.RTFParser.jj_consume_token(RTFParser.java:332) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.rtfparser.RTFParser.unicode_char(RTFParser.java:210) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.rtfparser.RTFParser.group(RTFParser.java:172) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.rtfparser.RTFParser.group(RTFParser.java:184) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.rtfparser.RTFParser.group(RTFParser.java:184) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.rtfparser.RTFParser.group(RTFParser.java:184) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.rtfparser.RTFParser.parse(RTFParser.java:34) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.HtmlFromRtf.extractHtml(HtmlFromRtf.java:49) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.HtmlFromRtf.<init>(HtmlFromRtf.java:13) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.ViewerHelper.extractHTMLFromRTF(ViewerHelper.java:88) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.ViewerPanel.lambda$bodyText$1(ViewerPanel.java:343) ~[msgviewer.jar:?]
        at at.redeye.FrameWork.base.AutoMBox.resultOrElse(AutoMBox.java:41) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.ViewerPanel.bodyText(ViewerPanel.java:344) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.ViewerPanel.updateBody(ViewerPanel.java:324) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.ViewerPanel.doParse(ViewerPanel.java:430) ~[msgviewer.jar:?]
        at at.redeye.FrameWork.base.AutoMBox.lambda$new$0(AutoMBox.java:18) ~[msgviewer.jar:?]
        at at.redeye.FrameWork.base.AutoMBox.resultOrElse(AutoMBox.java:41) ~[msgviewer.jar:?]
        at at.redeye.FrameWork.base.AutoMBox.run(AutoMBox.java:59) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.ViewerPanel.parse(ViewerPanel.java:420) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.ViewerPanel.view(ViewerPanel.java:106) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.SingleWin.openFile(SingleWin.java:33) ~[msgviewer.jar:?]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:?]
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) ~[?:?]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179) ~[?:?]
        at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:?]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) ~[?:?]
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:?]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:?]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) ~[?:?]
        at net.sourceforge.MSGViewer.ModuleLauncher.invokeGui(ModuleLauncher.java:108) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.ModuleLauncher.invoke(ModuleLauncher.java:64) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.ModuleLauncher.main(ModuleLauncher.java:27) ~[msgviewer.jar:?]

I can't provide the mail because it's private. Just wanted to open the issue so that we can track it. Maybe I'll find it myself.
Any ideas from the top of your head?

@ThomasChr
Copy link
Contributor Author

ThomasChr commented Sep 19, 2022

The error does come from this char: https://www.codetable.net/decimal/8729
Here is one to test:

This is the RTF I can see in the MsgViewer:

{\rtf1\ansi\ansicpg1251\fromhtml1 \fbidis \deff0{\fonttbl
{\f0\fswiss\fcharset204 Arial;}
{\f1\fmodern Courier New;}
{\f2\fnil\fcharset2 Symbol;}
{\f3\fmodern\fcharset0 Courier New;}
{\f4\fswiss "Arial";}}
{\colortbl\red0\green0\blue0;\red5\green99\blue193;}
\uc1\pard\plain\deftab360 \f0\fs24 
{\*\htmltag2 \par }
{\*\htmltag18 <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">}
{\*\htmltag34 <head>}
{\*\htmltag161 <meta name=Generator content="Microsoft Word 15 (filtered medium)">}
{\*\htmltag241 <style>}
{\*\htmltag241 <!--\par /* Font Definitions */\par @font-face\par \tab \{font-family:"Cambria Math";\par \tab panose-1:2 4 5 3 5 4 6 3 2 4;\}\par @font-face\par \tab \{font-family:Calibri;\par \tab panose-1:2 15 5 2 2 2 4 3 2 4;\}\par /* Style Definitions */\par p.MsoNormal, li.MsoNormal, div.MsoNormal\par \tab \{margin:0cm;\par \tab margin-bottom:.0001pt;\par \tab font-size:11.0pt;\par \tab font-family:"Calibri",sans-serif;\par \tab mso-fareast-language:EN-US;\}\par span.E-MailFormatvorlage17\par \tab \{mso-style-type:personal-compose;\par \tab font-family:"Arial",sans-serif;\par \tab font-variant:normal !important;\par \tab color:#0D0D0D;\par \tab text-transform:none;\par \tab font-weight:normal;\par \tab font-style:normal;\par \tab text-decoration:none none;\par \tab vertical-align:baseline;\}\par .MsoChpDefault\par \tab \{mso-style-type:export-only;\par \tab font-family:"Calibri",sans-serif;\par \tab mso-fareast-language:EN-US;\}\par @page WordSection1\par \tab \{size:612.0pt 792.0pt;\par \tab margin:70.85pt 70.85pt 2.0cm 70.85pt;\}\par div.WordSection1\par \tab \{page:WordSection1;\}\par -->}
{\*\htmltag249 </style>}
{\*\htmltag241 <!--[if gte mso 9]><xml>\par <o:shapedefaults v:ext="edit" spidmax="1026" />\par </xml><![endif]-->}
{\*\htmltag241 <!--[if gte mso 9]><xml>\par <o:shapelayout v:ext="edit">\par <o:idmap v:ext="edit" data="1" />\par </o:shapelayout></xml><![endif]-->}
{\*\htmltag41 </head>}
{\*\htmltag50 <body lang=DE link="#0563C1" vlink="#954F72">}\htmlrtf \lang1031 \htmlrtf0 
{\*\htmltag96 <div class=WordSection1>}\htmlrtf {\htmlrtf0 
{\*\htmltag64 <p class=MsoNormal>}\htmlrtf {\htmlrtf0 
{\*\htmltag84 &#8729;}\htmlrtf \u8729\'95\htmlrtf0 
{\*\htmltag148 <span style='font-family:"Arial",sans-serif;color:#0D0D0D'>}\htmlrtf {\f4 \htmlrtf0 
{\*\htmltag244 <o:p>}
{\*\htmltag252 </o:p>}
{\*\htmltag156 </span>}\htmlrtf }\htmlrtf0 \htmlrtf\par}\htmlrtf0
\htmlrtf \par
\htmlrtf0 
{\*\htmltag72 </p>}
{\*\htmltag104 </div>}\htmlrtf }\htmlrtf0 
{\*\htmltag58 </body>}
{\*\htmltag27 </html>}}

The problem should be somewhere around those three lines with the 8729 Bullet Operator:

{\*\htmltag64 <p class=MsoNormal>}\htmlrtf {\htmlrtf0 
{\*\htmltag84 &#8729;}\htmlrtf \u8729\'95\htmlrtf0 
{\*\htmltag148 <span style='font-family:"Arial",sans-serif;color:#0D0D0D'>}\htmlrtf {\f4 \htmlrtf0 

I'm trying to get closer to the culprit, but my RTF knowledge does only go that far...

@ThomasChr
Copy link
Contributor Author

ThomasChr commented Sep 19, 2022

Changing RTFParser.jj like this seems to do the trick:

Index: MSGViewer/src/main/javacc/net/sourceforge/MSGViewer/rtfparser/RTFParser.jj
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/MSGViewer/src/main/javacc/net/sourceforge/MSGViewer/rtfparser/RTFParser.jj b/MSGViewer/src/main/javacc/net/sourceforge/MSGViewer/rtfparser/RTFParser.jj
--- a/MSGViewer/src/main/javacc/net/sourceforge/MSGViewer/rtfparser/RTFParser.jj	(revision a177966d6fac70c7364a71679a551a038e680053)
+++ b/MSGViewer/src/main/javacc/net/sourceforge/MSGViewer/rtfparser/RTFParser.jj	(date 1663595334221)
@@ -131,7 +131,6 @@
 }
 {
     code = <C_UNICODE>
-    <STRING>
     {
         current_group.addUnicodeChar( code.image );
     }

(Just removed the <STRING> in Function unicode_char())

This will change the File RTFParser.java from this:

  final public void unicode_char() throws ParseException {Token code;
    code = jj_consume_token(C_UNICODE);
    jj_consume_token(STRING);
current_group.addUnicodeChar( code.image );
}

to this:

  final public void unicode_char() throws ParseException {Token code;
    code = jj_consume_token(C_UNICODE);
current_group.addUnicodeChar( code.image );
}

I think the jj_consume_token(STRING) isn't needed in this case - but I may be way off here!

Outcome: No exception and Unicode char is shown, but I'm not quite sure that I'm absolutely correct here.

@ThomasChr
Copy link
Contributor Author

Okay, the error is gone, but it seems that the char has been duplicated now:
image

So that doesn't seem to be the perfect fix.

@ThomasChr
Copy link
Contributor Author

Some more info from here: https://www.zopatista.com/python/2012/06/06/rtf-and-unicode/

{\*\htmltag84 &#8729;}\htmlrtf \u8729\'95\htmlrtf0 

So this Sequence is actually two times the same char.
One in the form \uUNICODECODEPOINT and one in the form 'WIN1252_CODEPOINT

I'm not quite sure why this is the case...

@ThomasChr
Copy link
Contributor Author

@ThomasChr
Copy link
Contributor Author

Okay, so at the moment I have two different cases for the parser:

Number 1:

\u8729\'b7

This is Unicode 8729 followed by WIN-1252 B7 -> Both are the "bullet point" character.
So this should be parsed as:
(src/main/javacc/net/sourceforge/MSGViewer/rtfparser/RTFParser.jj)
C_UNICODE (where we get the code point) followed by C_ESC_STRING which we just ignore/consume.

BUT, a few lines along my mail I got the following:

\u-3929 ?;

And THIS one seems to be F0A7 (65536 - 3929 in Hex because above 32767 RTF wants Unicode Chars to be negative) which is in the private Unicode Characters Area. I have no clue what this char does mean or what it does there, but it seems we're not alone:
https://stackoverflow.com/questions/37166528/in-rtf-what-is-the-meaning-of-u-3913

So this sequence seems to be some weird character (which we don't care) but followed by the replacement character '?' - which is NOT an C_ESC_STRING - it's a normal STRING.

It seems our parser needs to handle both those cases, which it does not at the moment.

ThomasChr added a commit to ThomasChr/MsgViewer that referenced this issue Sep 20, 2022
@ThomasChr
Copy link
Contributor Author

The Pull Request will fix the problem.
I never ever wanted to go so deep into the (parser) rabbit hole, so be gentle to me :-)

@lolo101
Copy link
Owner

lolo101 commented Sep 21, 2022

Hi @ThomasChr and thank you very much for your engagement 😃

That's right the you want to get rid of is the placeholder (the char that should be displayed when the parser does not support Unicode). However, it seems that in your case the placeholder is not a but a <C_ESC_STRING> which is unexpected 🙂

I'll take a look at the PR as soon as I have a comfortable time ahead

@ThomasChr
Copy link
Contributor Author

Take your time, no stress!

ThomasChr added a commit to ThomasChr/MsgViewer that referenced this issue Sep 26, 2022
ThomasChr added a commit to ThomasChr/MsgViewer that referenced this issue Sep 27, 2022
ThomasChr added a commit to ThomasChr/MsgViewer that referenced this issue Sep 27, 2022
ThomasChr added a commit to ThomasChr/MsgViewer that referenced this issue Sep 27, 2022
lolo101 pushed a commit that referenced this issue Sep 27, 2022
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants