Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ParseException could not get message when xml with invalid characters #29

Closed
kewudu opened this issue Apr 24, 2020 · 3 comments · Fixed by #123
Closed

ParseException could not get message when xml with invalid characters #29

kewudu opened this issue Apr 24, 2020 · 3 comments · Fixed by #123

Comments

@kewudu
Copy link

kewudu commented Apr 24, 2020

I get the following backtrace message when i load xml:

incompatible character encodings: UTF-8 and ASCII-8BIT
/usr/local/rvm/rubies/ruby-2.6.3/lib/ruby/gems/2.6.0/gems/rexml-3.2.2/lib/rexml/parseexception.rb:32:in `to_s'

the xml encoding is UTF-8 and with invalid characters, but parseexception to_s use ASCII-8BIT encoding, so here to_s will raise an exception with encoding fail, user will not get the actual error information in xml

@kou
Copy link
Member

kou commented Apr 25, 2020

Could you show a Ruby script and XML that reproduce this problem?

@kewudu
Copy link
Author

kewudu commented Apr 30, 2020

Could you show a Ruby script and XML that reproduce this problem?

My XML file contains invalid encoding, part of XML file is:

<?xml version="1.0" encoding="UTF-8"?>

<environmentblock>
  <userconf>
     <!--LocalHLTConfig.jsonτݾһˇҘѫքìիɧڻզ՚LocalHLTConfig.jsonτݾìȒτݾŚɝא"jsonfirst": "true"ʱìܡԅЈʹԃL -->
    .... lots of content...
  </userconf>
</environmentblock>

my rub script is simple:

require 'rexml/document'
include REXML
require 'json'

# $confing_path is the full path of the xml file which I want to load
xml_file = Document.new(File::open($confing_path))

the xml file is utf-8 encoding, I know the xml contains invalid characters, after I load the xml file, ruby raise <REXML::ParseException: #<ArgumentError: invalid byte sequence in UTF-8> excepiton and I cann't get the exact error info by exception message, if I temporary change the ParseException to_s method line 32 to utf-8 like this:
err << @source.buffer[0..80].force_encoding("UTF-8"), now I get the exact error information in xml file:
...
Exception parsing
Line: 5
Position: 270
Last 80 unconsumed characters:
^M

@iangreenleaf
Copy link

Here's a very simple reproduction of this bug (the base64 stuff is just there to make sure the special characters in the string come through):

require 'rexml/document'
require 'base64'
include REXML

begin
  REXML::Document.new(Base64.decode64("YT08YSDigIs+4oCL\n"))
  # Equivalent to:
  # REXML::Document.new "a=<a ​>​"
rescue => e
  e.to_s
end

The input is invalid XML and rightly triggers a ParseException, but then reading the exception's attributes raises another error: incompatible character encodings: UTF-8 and ASCII-8BIT (Encoding::CompatibilityError).

It looks like this is a bug in the ParseException code. Line 21 initializes an empty string, which defaults to UTF8 encoding. Then line 32 forces a string's encoding to ASCII-8BIT and tries to append it to the UTF8 string, which triggers the encoding mismatch:

err << @source.buffer[0..80].force_encoding("ASCII-8BIT").gsub(/\n/, ' ')

naitoh added a commit to naitoh/rexml that referenced this issue May 3, 2024
…etrieved if the error content contained Unicode characters.

## Why?
If the xml tag contains Unicode characters when the error occurs, an `Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT` exception is raised, ParseException error message cannot be retrieved.

See: ruby#29
@kou kou closed this as completed in #123 May 3, 2024
kou pushed a commit that referenced this issue May 3, 2024
…alid encoding XML (#123)

## Why?

If the XML tag contains Unicode characters and an error is occurred for
the tag, an incompatible encoding error is raised. Because our parse
exception message parts have an UTF-8 part (that includes the target tag
information) and an ASCII-8BIT part (that includes error context input).

Fix GH-29

Reported by DuKewu. Thanks!!!
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants