Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Cuckoo utf-8 encoding error on filenames with Umlauts #136

Closed
michaelweiser opened this issue Feb 20, 2020 · 2 comments · Fixed by scVENUS/PeekabooAV-Installer#62
Closed
Assignees
Milestone

Comments

@michaelweiser
Copy link
Contributor

When submitting a sample whose declared_name contains Umlauts (or other non-ascii characters), Cuckoo REST API requests for the task status fail with a utf-8 encoding error in the REST API. The filename displayed in the backtrace looks suspiciously like being latin1 encoded.

Actual error message to follow.

We need to find out if it's us submitting the filename parameter in the wrong encoding or it somehow reverting to latin1 on it's roundtrip to and from analysis on Windows.

Related to our addition of submitting the original filename as per #81.

@michaelweiser
Copy link
Contributor Author

TL;DR: We're running into something very similar but not identical to cuckoosandbox/cuckoo/issues/2473.

We're submitting correctly encoded as utf-8. It arrives at the Cuckoo API in utf-8 and is entered into the database correctly. In the database and on the wire back from the database it is still utf-8. But when handed to SQLAlchemy by the database module it becomes latin-1. Adding ?encoding=utf8 to the connection string works around the problem. The behaviour is specific to the mysqlclient module. When switching to PyMySQL or postgres, it works fine without explicit encoding specification.

@michaelweiser
Copy link
Contributor Author

Reproducer:

$ cat encoding.py
import MySQLdb
db = MySQLdb.connect(db="cuckoo", user="cuckoo", passwd="foo")
print(db.character_set_name())
c = db.cursor()
c.execute("select * from tasks;")
print(c.fetchone()[1])

db = MySQLdb.connect(db="cuckoo", user="cuckoo", passwd="foo", charset='utf8')
print(db.character_set_name())
c = db.cursor()
c.execute("select * from tasks;")
print(c.fetchone()[1])

Output:

$ /opt/cuckoo/bin/python encoding.py
latin1
/tmp/cuckoo-tmp-cuckoo/tmppAD1B0/F�b�r.py
utf8
/tmp/cuckoo-tmp-cuckoo/tmppAD1B0/FÜbär.py

libmysqlclient (C library) default client encoding seems to be latin1, even on an otherwise unicode-only system:

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
[...]

python3 seems to handle it correctly even when the default encoding of latin1 is used:

$ /opt/peekaboo/bin/python3 encoding.py
latin1
/tmp/cuckoo-tmp-cuckoo/tmppAD1B0/FÜbär.py
utf8
/tmp/cuckoo-tmp-cuckoo/tmppAD1B0/FÜbär.py
$ /opt/peekaboo/bin/python3 encoding.py | od -c
0000000   l   a   t   i   n   1  \n   /   t   m   p   /   c   u   c   k
0000020   o   o   -   t   m   p   -   c   u   c   k   o   o   /   t   m
0000040   p   p   A   D   1   B   0   /   F 303 234   b 303 244   r   .
0000060   p   y  \n   u   t   f   8  \n   /   t   m   p   /   c   u   c
0000100   k   o   o   -   t   m   p   -   c   u   c   k   o   o   /   t
0000120   m   p   p   A   D   1   B   0   /   F 303 234   b 303 244   r
0000140   .   p   y  \n
0000144

python2 breaks again when talking to a pipe:

$ /opt/cuckoo/bin/python encoding.py | od -c
Traceback (most recent call last):
  File "encoding.py", line 12, in <module>
    print(c.fetchone()[1])
UnicodeEncodeError: 'ascii' codec can't encode character u'\xdc' in position 34: ordinal not in range(128)
0000000   l   a   t   i   n   1  \n   /   t   m   p   /   c   u   c   k
0000020   o   o   -   t   m   p   -   c   u   c   k   o   o   /   t   m
0000040   p   p   A   D   1   B   0   /   F 334   b 344   r   .   p   y
0000060  \n   u   t   f   8  \n
0000066

On the wire it seems to be utf8 but then in python somehow seems to become latin1 in both cases at least transiently (otherwise the ascii codec would talk of u'\dc' in the utf-8 case):

$ strace -fe recvfrom -s 65536 /opt/cuckoo/bin/python encoding.py | od -c
recvfrom(4, "\1\0\0\1\2[...]\00211+/tmp/cuckoo-tmp-cuckoo/tmppAD1B0/F\303\234b\303\244r.py\4file\0010\0011\0\0\0\0\0\0\0010\0010\0232020-02-26 17:30:57\0232020-02-26 17:30:57\373\373\7pending\0011\373\373\373\5\0\0\32\376\0\0!\0", 16384, 0, NULL, NULL) = 1404
0000000   l   a   t   i   n   1  \n   /   t   m   p   /   c   u   c   k
0000020   o   o   -   t   m   p   -   c   u   c   k   o   o   /   t   m
0000040   p   p   A   D   1   B   0   /   F 334   b 344   r   .   p   y
Traceback (most recent call last):
  File "encoding.py", line 12, in <module>
    print(c.fetchone()[1])
UnicodeEncodeError: 'ascii' codec can't encode character u'\\ xdc' in position 34: ordinal not in range(128)
0000060  \n   u   t   f   8  \n
0000066

@michaelweiser michaelweiser removed the bug label Feb 27, 2020
michaelweiser added a commit to michaelweiser/PeekabooAV-Installer that referenced this issue Feb 27, 2020
Problems like scVENUS/PeekabooAV#136 seem to be caused by a (still
somewhat mysterious) latin1 encoding default in the mysqlclient package
or rather libmysqlclient C library it uses. This seems to be mitigated
in python3. As a workaround for python2 we add ?charset=utf8 to the
connect string.

Closes scVENUS/PeekabooAV#136.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant