Exception: Encountered <C_ESC_STRING>, Expected <STRING> #136

ThomasChr · 2022-09-19T10:11:12Z

I've got a mail which throws the following error:

2022-09-19T12:07:32,092 ERROR (AutoMBox.java:43): Exception: net.sourceforge.MSGViewer.rtfparser.ParseException: Encountered " <C_ESC_STRING> "\\\'b7 "" at line 608, column 160.
Was expecting:
    <STRING> ...

net.sourceforge.MSGViewer.rtfparser.ParseException: Encountered " <C_ESC_STRING> "\\\'b7 "" at line 608, column 160.
Was expecting:
    <STRING> ...

        at net.sourceforge.MSGViewer.rtfparser.RTFParser.generateParseException(RTFParser.java:394) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.rtfparser.RTFParser.jj_consume_token(RTFParser.java:332) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.rtfparser.RTFParser.unicode_char(RTFParser.java:210) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.rtfparser.RTFParser.group(RTFParser.java:172) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.rtfparser.RTFParser.group(RTFParser.java:184) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.rtfparser.RTFParser.group(RTFParser.java:184) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.rtfparser.RTFParser.group(RTFParser.java:184) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.rtfparser.RTFParser.parse(RTFParser.java:34) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.HtmlFromRtf.extractHtml(HtmlFromRtf.java:49) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.HtmlFromRtf.<init>(HtmlFromRtf.java:13) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.ViewerHelper.extractHTMLFromRTF(ViewerHelper.java:88) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.ViewerPanel.lambda$bodyText$1(ViewerPanel.java:343) ~[msgviewer.jar:?]
        at at.redeye.FrameWork.base.AutoMBox.resultOrElse(AutoMBox.java:41) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.ViewerPanel.bodyText(ViewerPanel.java:344) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.ViewerPanel.updateBody(ViewerPanel.java:324) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.ViewerPanel.doParse(ViewerPanel.java:430) ~[msgviewer.jar:?]
        at at.redeye.FrameWork.base.AutoMBox.lambda$new$0(AutoMBox.java:18) ~[msgviewer.jar:?]
        at at.redeye.FrameWork.base.AutoMBox.resultOrElse(AutoMBox.java:41) ~[msgviewer.jar:?]
        at at.redeye.FrameWork.base.AutoMBox.run(AutoMBox.java:59) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.ViewerPanel.parse(ViewerPanel.java:420) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.ViewerPanel.view(ViewerPanel.java:106) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.SingleWin.openFile(SingleWin.java:33) ~[msgviewer.jar:?]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183) ~[?:?]
        at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:197) ~[?:?]
        at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:179) ~[?:?]
        at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:?]
        at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484) ~[?:?]
        at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474) ~[?:?]
        at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150) ~[?:?]
        at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173) ~[?:?]
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:?]
        at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596) ~[?:?]
        at net.sourceforge.MSGViewer.ModuleLauncher.invokeGui(ModuleLauncher.java:108) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.ModuleLauncher.invoke(ModuleLauncher.java:64) ~[msgviewer.jar:?]
        at net.sourceforge.MSGViewer.ModuleLauncher.main(ModuleLauncher.java:27) ~[msgviewer.jar:?]

I can't provide the mail because it's private. Just wanted to open the issue so that we can track it. Maybe I'll find it myself.
Any ideas from the top of your head?

The text was updated successfully, but these errors were encountered:

ThomasChr · 2022-09-19T11:20:34Z

The error does come from this char: https://www.codetable.net/decimal/8729
Here is one to test:

∙

This is the RTF I can see in the MsgViewer:

{\rtf1\ansi\ansicpg1251\fromhtml1 \fbidis \deff0{\fonttbl
{\f0\fswiss\fcharset204 Arial;}
{\f1\fmodern Courier New;}
{\f2\fnil\fcharset2 Symbol;}
{\f3\fmodern\fcharset0 Courier New;}
{\f4\fswiss "Arial";}}
{\colortbl\red0\green0\blue0;\red5\green99\blue193;}
\uc1\pard\plain\deftab360 \f0\fs24 
{\*\htmltag2 \par }
{\*\htmltag18 <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">}
{\*\htmltag34 <head>}
{\*\htmltag161 <meta name=Generator content="Microsoft Word 15 (filtered medium)">}
{\*\htmltag241 <style>}
{\*\htmltag241 <!--\par /* Font Definitions */\par @font-face\par \tab \{font-family:"Cambria Math";\par \tab panose-1:2 4 5 3 5 4 6 3 2 4;\}\par @font-face\par \tab \{font-family:Calibri;\par \tab panose-1:2 15 5 2 2 2 4 3 2 4;\}\par /* Style Definitions */\par p.MsoNormal, li.MsoNormal, div.MsoNormal\par \tab \{margin:0cm;\par \tab margin-bottom:.0001pt;\par \tab font-size:11.0pt;\par \tab font-family:"Calibri",sans-serif;\par \tab mso-fareast-language:EN-US;\}\par span.E-MailFormatvorlage17\par \tab \{mso-style-type:personal-compose;\par \tab font-family:"Arial",sans-serif;\par \tab font-variant:normal !important;\par \tab color:#0D0D0D;\par \tab text-transform:none;\par \tab font-weight:normal;\par \tab font-style:normal;\par \tab text-decoration:none none;\par \tab vertical-align:baseline;\}\par .MsoChpDefault\par \tab \{mso-style-type:export-only;\par \tab font-family:"Calibri",sans-serif;\par \tab mso-fareast-language:EN-US;\}\par @page WordSection1\par \tab \{size:612.0pt 792.0pt;\par \tab margin:70.85pt 70.85pt 2.0cm 70.85pt;\}\par div.WordSection1\par \tab \{page:WordSection1;\}\par -->}
{\*\htmltag249 </style>}
{\*\htmltag241 <!--[if gte mso 9]><xml>\par <o:shapedefaults v:ext="edit" spidmax="1026" />\par </xml><![endif]-->}
{\*\htmltag241 <!--[if gte mso 9]><xml>\par <o:shapelayout v:ext="edit">\par <o:idmap v:ext="edit" data="1" />\par </o:shapelayout></xml><![endif]-->}
{\*\htmltag41 </head>}
{\*\htmltag50 <body lang=DE link="#0563C1" vlink="#954F72">}\htmlrtf \lang1031 \htmlrtf0 
{\*\htmltag96 <div class=WordSection1>}\htmlrtf {\htmlrtf0 
{\*\htmltag64 <p class=MsoNormal>}\htmlrtf {\htmlrtf0 
{\*\htmltag84 &#8729;}\htmlrtf \u8729\'95\htmlrtf0 
{\*\htmltag148 <span style='font-family:"Arial",sans-serif;color:#0D0D0D'>}\htmlrtf {\f4 \htmlrtf0 
{\*\htmltag244 <o:p>}
{\*\htmltag252 </o:p>}
{\*\htmltag156 </span>}\htmlrtf }\htmlrtf0 \htmlrtf\par}\htmlrtf0
\htmlrtf \par
\htmlrtf0 
{\*\htmltag72 </p>}
{\*\htmltag104 </div>}\htmlrtf }\htmlrtf0 
{\*\htmltag58 </body>}
{\*\htmltag27 </html>}}

The problem should be somewhere around those three lines with the 8729 Bullet Operator:

{\*\htmltag64 <p class=MsoNormal>}\htmlrtf {\htmlrtf0 
{\*\htmltag84 &#8729;}\htmlrtf \u8729\'95\htmlrtf0 
{\*\htmltag148 <span style='font-family:"Arial",sans-serif;color:#0D0D0D'>}\htmlrtf {\f4 \htmlrtf0

I'm trying to get closer to the culprit, but my RTF knowledge does only go that far...

ThomasChr · 2022-09-19T13:39:49Z

Changing RTFParser.jj like this seems to do the trick:

Index: MSGViewer/src/main/javacc/net/sourceforge/MSGViewer/rtfparser/RTFParser.jj
IDEA additional info:
Subsystem: com.intellij.openapi.diff.impl.patch.CharsetEP
<+>UTF-8
===================================================================
diff --git a/MSGViewer/src/main/javacc/net/sourceforge/MSGViewer/rtfparser/RTFParser.jj b/MSGViewer/src/main/javacc/net/sourceforge/MSGViewer/rtfparser/RTFParser.jj
--- a/MSGViewer/src/main/javacc/net/sourceforge/MSGViewer/rtfparser/RTFParser.jj	(revision a177966d6fac70c7364a71679a551a038e680053)
+++ b/MSGViewer/src/main/javacc/net/sourceforge/MSGViewer/rtfparser/RTFParser.jj	(date 1663595334221)
@@ -131,7 +131,6 @@
 }
 {
     code = <C_UNICODE>
-    <STRING>
     {
         current_group.addUnicodeChar( code.image );
     }

(Just removed the <STRING> in Function unicode_char())

This will change the File RTFParser.java from this:

  final public void unicode_char() throws ParseException {Token code;
    code = jj_consume_token(C_UNICODE);
    jj_consume_token(STRING);
current_group.addUnicodeChar( code.image );
}

to this:

  final public void unicode_char() throws ParseException {Token code;
    code = jj_consume_token(C_UNICODE);
current_group.addUnicodeChar( code.image );
}

I think the jj_consume_token(STRING) isn't needed in this case - but I may be way off here!

Outcome: No exception and Unicode char is shown, but I'm not quite sure that I'm absolutely correct here.

ThomasChr · 2022-09-19T13:51:04Z

Okay, the error is gone, but it seems that the char has been duplicated now:

So that doesn't seem to be the perfect fix.

ThomasChr · 2022-09-20T10:03:53Z

Some more info from here: https://www.zopatista.com/python/2012/06/06/rtf-and-unicode/

{\*\htmltag84 &#8729;}\htmlrtf \u8729\'95\htmlrtf0

So this Sequence is actually two times the same char.
One in the form \uUNICODECODEPOINT and one in the form 'WIN1252_CODEPOINT

I'm not quite sure why this is the case...

ThomasChr · 2022-09-20T11:10:49Z

A few infos how the parser works: https://stackoverflow.com/questions/17310377/what-does-consume-mean-in-javacc

ThomasChr · 2022-09-20T11:27:35Z

Okay, so at the moment I have two different cases for the parser:

Number 1:

\u8729\'b7

This is Unicode 8729 followed by WIN-1252 B7 -> Both are the "bullet point" character.
So this should be parsed as:
(src/main/javacc/net/sourceforge/MSGViewer/rtfparser/RTFParser.jj)
C_UNICODE (where we get the code point) followed by C_ESC_STRING which we just ignore/consume.

BUT, a few lines along my mail I got the following:

\u-3929 ?;

And THIS one seems to be F0A7 (65536 - 3929 in Hex because above 32767 RTF wants Unicode Chars to be negative) which is in the private Unicode Characters Area. I have no clue what this char does mean or what it does there, but it seems we're not alone:
https://stackoverflow.com/questions/37166528/in-rtf-what-is-the-meaning-of-u-3913

So this sequence seems to be some weird character (which we don't care) but followed by the replacement character '?' - which is NOT an C_ESC_STRING - it's a normal STRING.

It seems our parser needs to handle both those cases, which it does not at the moment.

ThomasChr · 2022-09-20T12:25:51Z

The Pull Request will fix the problem.
I never ever wanted to go so deep into the (parser) rabbit hole, so be gentle to me :-)

lolo101 · 2022-09-21T20:14:38Z

Hi @ThomasChr and thank you very much for your engagement 😃

That's right the you want to get rid of is the placeholder (the char that should be displayed when the parser does not support Unicode). However, it seems that in your case the placeholder is not a but a <C_ESC_STRING> which is unexpected 🙂

I'll take a look at the PR as soon as I have a comfortable time ahead

ThomasChr · 2022-09-21T20:24:56Z

Take your time, no stress!

ThomasChr added a commit to ThomasChr/MsgViewer that referenced this issue Sep 20, 2022

fix unicode parsing. fixes lolo101#136

029243c

ThomasChr mentioned this issue Sep 20, 2022

Fix unicode parsing #139

Merged

ThomasChr added a commit to ThomasChr/MsgViewer that referenced this issue Sep 26, 2022

fix unicode parsing. fixes lolo101#136

5d922d6

ThomasChr added a commit to ThomasChr/MsgViewer that referenced this issue Sep 27, 2022

fix unicode parsing. fixes lolo101#136

fa9c7bd

ThomasChr added a commit to ThomasChr/MsgViewer that referenced this issue Sep 27, 2022

fix unicode parsing. fixes lolo101#136

7809917

ThomasChr added a commit to ThomasChr/MsgViewer that referenced this issue Sep 27, 2022

fix unicode parsing. fixes lolo101#136

ef44f62

lolo101 closed this as completed in #139 Sep 27, 2022

lolo101 pushed a commit that referenced this issue Sep 27, 2022

fix unicode parsing. fixes #136

07526fb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception: Encountered <C_ESC_STRING>, Expected <STRING> #136

Exception: Encountered <C_ESC_STRING>, Expected <STRING> #136

ThomasChr commented Sep 19, 2022 •

edited

Loading

ThomasChr commented Sep 19, 2022 •

edited

Loading

ThomasChr commented Sep 19, 2022 •

edited

Loading

ThomasChr commented Sep 19, 2022

ThomasChr commented Sep 20, 2022

ThomasChr commented Sep 20, 2022

ThomasChr commented Sep 20, 2022

ThomasChr commented Sep 20, 2022

lolo101 commented Sep 21, 2022

ThomasChr commented Sep 21, 2022

Exception: Encountered <C_ESC_STRING>, Expected <STRING> #136

Exception: Encountered <C_ESC_STRING>, Expected <STRING> #136

Comments

ThomasChr commented Sep 19, 2022 • edited Loading

ThomasChr commented Sep 19, 2022 • edited Loading

ThomasChr commented Sep 19, 2022 • edited Loading

ThomasChr commented Sep 19, 2022

ThomasChr commented Sep 20, 2022

ThomasChr commented Sep 20, 2022

ThomasChr commented Sep 20, 2022

ThomasChr commented Sep 20, 2022

lolo101 commented Sep 21, 2022

ThomasChr commented Sep 21, 2022

ThomasChr commented Sep 19, 2022 •

edited

Loading

ThomasChr commented Sep 19, 2022 •

edited

Loading

ThomasChr commented Sep 19, 2022 •

edited

Loading