Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

How to ignore Content-Transfer-Encoding when the text(mail) crawling #1449

Closed
anatomo opened this issue Jan 17, 2018 · 0 comments
Closed

How to ignore Content-Transfer-Encoding when the text(mail) crawling #1449

anatomo opened this issue Jan 17, 2018 · 0 comments

Comments

@anatomo
Copy link

anatomo commented Jan 17, 2018

Hello @marevol.

After I asked #1442, I decoded mail files to utf-8 before crawling. And crawl these files.
But, Maybe the crawler looks parse mail header (the header part has other encoding type).
So, Could you advise how to ignore Content-Transfer-Encoding or Mail Header?
(I want to crawling these files as text/plain or utf-8.)

When crawl this file, It looks good.
test1.txt

But crawl this file, It does not show message part (digest field does not have message part).
test2.txt

thanks.

@anatomo anatomo closed this as completed Jan 25, 2018
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant