Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Output of german umlauts is wrong #338

Closed
dat-leth opened this issue Dec 24, 2013 · 2 comments
Closed

Output of german umlauts is wrong #338

dat-leth opened this issue Dec 24, 2013 · 2 comments

Comments

@dat-leth
Copy link

If I'm scraping a german webpage containing special characters like Ä, ä, Ö, ö, Ü, ü or ß, the output will be wrong (displaying � or other fancy stuff).
Seems like cheerio doesn't support UTF-8 or ISO-8859-1.

EDIT: It's not an issue by cheerio itself.

@SkoricIT
Copy link

SkoricIT commented Jan 6, 2015

So, what is the issue then? I'm having it right now.

fs.readFileAsync('survey_logic_file.html', {encoding: 'UTF-8'}).then (rawhtml)->
  fs.writeFile 'raw.html', rawhtml
  $ = cheerio.load rawhtml
  fs.writeFile 'cheerio.html', $.html()

In this code after write raw.html is correct while cheerio.html is broken.

raw.html:

<h3>Wofür werden die Ergebnisse der Umfrage benutzt?</h3>

cheerio.html:

<h3>Wof&#xFC;r werden die Ergebnisse der Umfrage benutzt?</h3>

If $.xml() is used the problem disappears. Unfortunately it seems that xml() can not be used on single nodes/selections?

@SkoricIT
Copy link

SkoricIT commented Jan 6, 2015

Well for reference if someone else finds this:
It seems that you need to set decodeEntities: false to get the correct behavior.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Development

No branches or pull requests

2 participants