Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Wrong text encoding assumption #212

Open
hongquan opened this issue Jan 7, 2022 · 7 comments
Open

Wrong text encoding assumption #212

hongquan opened this issue Jan 7, 2022 · 7 comments

Comments

@hongquan
Copy link

hongquan commented Jan 7, 2022

This image

QR

contain the text "Il était une fois, un noël radieiux et un gros test. Manchmal sind wir über freundlich."

but ZBar returns "Il 矇tait une fois, un no禱l radieiux et un gros test. Manchmal sind wir 羹ber freundlich.".

@ldoolitt
Copy link
Contributor

I was in this section of the code hacking on issue #237. I can see two things:

  1. False-positive in text_is_big5()
  2. The order of encodings in enc_list[] puts UTF at the end, and somehow your string works OK as SJIS.

Disabling text_is_big5 and reordering enc_list to put utf8_cd at the front yields a version that gives your desired result.
Feel free to do that on your local machine.
This needs attention from someone who knows more about character encoding than me, before any code changes get pushed to zbar master.

@melolontha-melolontha
Copy link

It looks that the same bug was already reported several years ago in https://sourceforge.net/p/zbar/bugs/73/ in the special case of accented characters.

I came across this bug after checking a QR code created from a vcard which used twice the German character ß. The QR code had been created this way:

cat "my_Vcard_with_GPG_fingerprint.vcf" | qrencode -s 3 -v 10 -o q.png

The created QR code was ok (according to my mobile phone's app BarcodeScsanner Version 4.7.8 and other Code scanner apps).

zbarimg q.png

however displayed the German letter ß as Chinese letter テ歹. The OP already showed the bug for some french accented vovels and ligatures and for German Umlaut lower case ü. I would not be surprised if many or even all country specific characters, e.g.

  • the Danish Ø, the bolle-å from various scandinavian languages,

  • characters with diaeresis or trema, macron, accent grave, accent aigu, cedille,

  • guillemets (French quotation marks),

  • Czech haceks, Polish ogoneks,

  • long vovels in the Hungarian language,

  • Spanish ñ, ¿ and ¡ for Spanish and Portuese

all go wrong.

For some strange reason, zbarimg seems to use non UTF-8 output for characters other than pure ASCII characters. zbarimg should output as UTF-8 by default - or at least should be given an option to do so.

@unDocUMeantIt
Copy link

i just ran into this as well, trying to verify QR codes i had created myself. they contain PGP signed meta data, and the non-UTF-8 decoding of umlaut characters now invalidates these signatures. a barcode reader app on my smartphone correctly decodes the QR codes, these signatures are valid as expected.

i've noticed that this issue also affects the GUI QtQR, as it relies on the python library.

i'm not sure autodetection of encodings can be done reliably at all. at least, UTF-8 should probably be the default, and there should be an encoding parameter to manually set the desired encoding (e.g., anything from iconv -l) to override autodetection in case it fails.

@ldoolitt
Copy link
Contributor

I'm glad someone is looking at this. Y'all are probably the "someone who knows more about character encoding than me" mentioned above :-)
If you actually try to modify the code in qrdectxt.c to fix encoding bugs, I humbly suggest you start with the version I made that's hanging out in MR #241. That copy fixes one easy-ish bug, and is much better formatted for maintenance.

This brings up a key point: are the project owners still around, so someone can actually accept merge requests into master?

@martinxyz
Copy link

martinxyz commented Apr 3, 2023

This is (probably) the same issue in gnome Decoder. Summary:

  • Setting ZBAR_CFG_BINARY seems to be a (bad?) workaround. (At least by setting the "binary" checkbox in zbarcam_qt.)
  • Looks like the underlying encoding problem (Stackoverflow) is pretty ugly. (Some guessing is required.)

This Python session reproduces the mistake:

In [5]: 'Zürich'.encode('UTF8').decode('BIG5')
Out[5]: 'Z羹rich'
In [6]: 'Il était une fois, un noël'.encode('UTF8').decode('BIG5')
Out[6]: 'Il 矇tait une fois, un no禱l'

So, the issue seems that it prefers BIG-5 over UTF-8. (I haven't understood the logic in qrdectxt.c yet.) Not sure I like that assumption, but as per link above, it's possible that it's the correct order in some places of the world. (Certainly not in Zürich, though.)

@matheusmoreira
Copy link
Contributor

  • Setting ZBAR_CFG_BINARY seems to be a (bad?) workaround. (At least by setting the "binary" checkbox in zbarcam_qt.)

It is a good workaround. I implemented the binary decoding option by bypassing the built-in character encoding conversion. It just returns the data as-is so it can be decoded separately.

Care must be taken to decode every QR code individually though. Otherwise, you won't be able to tell where each QR code begins or ends.

@martinxyz
Copy link

My concern is that it may be worse for case where the QR code actually has an encoding set. In this case it would be possible to convert it to text correctly no matter what, if the library does the conversion to text.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants