Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[bug] Nokogiri::XML::Reader.from_io.each misidentifies character encoding? #2882

Closed
koshigoe opened this issue May 18, 2023 · 6 comments · Fixed by #2883
Closed

[bug] Nokogiri::XML::Reader.from_io.each misidentifies character encoding? #2882

koshigoe opened this issue May 18, 2023 · 6 comments · Fixed by #2883

Comments

@koshigoe
Copy link

Please describe the bug

Nokogiri::XML::Reader.from_io.each cause exception Nokogiri::XML::SyntaxError when XML node contain long non-ascii characters.
The XML node contain only valid UTF-8 characters, but cause error FATAL: Input is not proper UTF-8, indicate encoding !.

Help us reproduce what you're seeing

require 'nokogiri'
require 'stringio'

NON_ASCII = "\u{3042}"
XML_TEMPLATE =<<~XML
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <a>%<content>s</a>
</root>
XML

[325, 326].each do |length|
  io = StringIO.new(format(XML_TEMPLATE, content: NON_ASCII * length))
  Nokogiri::XML::Reader.from_io(io).tap { |x| pp x  }.each { }
  puts "OK: #{length}"
rescue => e
  puts "NG: #{length} (#{e.inspect})"
end

__END__

ruby 3.1.4p223 (2023-03-30 revision 957bb7cb81) [arm64-darwin22]

### nokogiri 1.14.4

OK: 325
OK: 326

### nokogiri 1.15.0

OK: 325
NG: 326 (#<Nokogiri::XML::SyntaxError: 3:332: FATAL: Input is not proper UTF-8, indicate encoding !>)

Expected behavior

Do not raise error.

Environment

# Nokogiri (1.15.0)
    ---
    warnings: []
    nokogiri:
      version: 1.15.0
      cppflags:
      - "-I/Users/koshigoe/.rbenv/versions/3.1.4/lib/ruby/gems/3.1.0/gems/nokogiri-1.15.0-arm64-darwin/ext/nokogiri"
      - "-I/Users/koshigoe/.rbenv/versions/3.1.4/lib/ruby/gems/3.1.0/gems/nokogiri-1.15.0-arm64-darwin/ext/nokogiri/include"
      - "-I/Users/koshigoe/.rbenv/versions/3.1.4/lib/ruby/gems/3.1.0/gems/nokogiri-1.15.0-arm64-darwin/ext/nokogiri/include/libxml2"
      ldflags: []
    ruby:
      version: 3.1.4
      platform: arm64-darwin22
      gem_platform: arm64-darwin-22
      description: ruby 3.1.4p223 (2023-03-30 revision 957bb7cb81) [arm64-darwin22]
      engine: ruby
    libxml:
      source: packaged
      precompiled: true
      patches:
      - 0001-Remove-script-macro-support.patch
      - 0002-Update-entities-to-remove-handling-of-ssi.patch
      - 0003-libxml2.la-is-in-top_builddir.patch
      - '0009-allow-wildcard-namespaces.patch'
      - 0010-update-config.guess-and-config.sub-for-libxml2.patch
      - 0011-rip-out-libxml2-s-libc_single_threaded-support.patch
      libxml2_path: "/Users/koshigoe/.rbenv/versions/3.1.4/lib/ruby/gems/3.1.0/gems/nokogiri-1.15.0-arm64-darwin/ext/nokogiri"
      memory_management: ruby
      iconv_enabled: true
      compiled: 2.11.3
      loaded: 2.11.3
    libxslt:
      source: packaged
      precompiled: true
      patches:
      - 0001-update-config.guess-and-config.sub-for-libxslt.patch
      datetime_enabled: true
      compiled: 1.1.38
      loaded: 1.1.38
    other_libraries:
      zlib: 1.2.13
      libiconv: '1.17'
      libgumbo: 1.0.0-nokogiri

Additional context

ruby 3.1.4p223 (2023-03-30 revision 957bb7cb81) [arm64-darwin22]
@koshigoe koshigoe added the state/needs-triage Inbox for non-installation-related bug reports or help requests label May 18, 2023
@flavorjones
Copy link
Member

@koshigoe Thank you for reporting this! This error message is being generated by libxml2. I have reproduced the issue and will investigate.

@flavorjones
Copy link
Member

Git bisect shows that this is the commit that introduced the new behavior:

https://gitlab.gnome.org/GNOME/libxml2/-/commit/3582b07bd24d438be7dd08ab57e3f9e635373e32

commit 3582b07bd24d438be7dd08ab57e3f9e635373e32
Author: Nick Wellnhofer <wellnhofer@aevum.de>
Date:   Sun Nov 13 22:57:32 2022 +0100

    parser: Fix content parser progress checks
    
    This is another attempt at fixing parser progress checks. Instead of
    relying on in->consumed, which could overflow, change some content
    parser functions to make guaranteed progress on certain byte sequences.

@flavorjones flavorjones added upstream/libxml2 and removed state/needs-triage Inbox for non-installation-related bug reports or help requests labels May 18, 2023
@flavorjones
Copy link
Member

I've narrowed this down to specific changes in libxml2 chunk parsing that may be a bug. I'll open an issue upstream and link to it here.

@flavorjones
Copy link
Member

flavorjones commented May 18, 2023

Neat! This was already reported upstream at https://gitlab.gnome.org/GNOME/libxml2/-/issues/542 and was fixed about an hour ago in https://gitlab.gnome.org/GNOME/libxml2/-/commit/e0f3016f71297314502a3620a301d7e064cbb612

I expect it'll be fixed shortly in a libxml2 release. I'll leave this open until that happens and I can ship a new nokogiri release.

@flavorjones
Copy link
Member

libxml2 v2.11.4 is out with the fix: https://gitlab.gnome.org/GNOME/libxml2/-/releases/v2.11.4

I'll try to get a release out in the next day.

@flavorjones
Copy link
Member

Nokogiri v1.15.1 is out with this upstream fix. https://github.com/sparklemotion/nokogiri/releases/tag/v1.15.1

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants