Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Fails on groups with adult content #14

Open
RuinCakeLie opened this issue Mar 20, 2016 · 5 comments
Open

Fails on groups with adult content #14

RuinCakeLie opened this issue Mar 20, 2016 · 5 comments
Labels

Comments

@RuinCakeLie
Copy link

I'm trying to scrape a group (https://groups.google.com/forum/#!topic/3dprintertipstricksreviews/) with the adult content flag turned on. Unfortunately, even using cookies all the escaped_fragment requests only return:

Adult Content Warning

The Group you selected has been identified by its owner as containing adult content.
@icy
Copy link
Owner

icy commented Mar 20, 2016

Interesting. I will take a look. Thanks for your reporting.

@icy
Copy link
Owner

icy commented Mar 22, 2016

Google yields empty contents when escaped_fragement is specified, e.g.

https://groups.google.com/forum/?_escaped_fragment_=forum/3dprintertipstricksreviews

This is against (?) the standard. We need a different way to receive data from Google. This is a real challenge!

@icy
Copy link
Owner

icy commented Mar 22, 2016

Google hides most email headers from the raw message. A raw message isn't actually raw ;)

See also https://groups.google.com/forum/message/raw?msg=3dprintertipstricksreviews/LDFZVHeC8Uk/2D1YhGqGDQAJ

Date: Sun, 20 Mar 2016 06:28:20 -0700 (PDT)
From: Rich Webb <ml...@rawebb.net>
To: 
    "3D Printer Tips, Tricks and Reviews" <3dprintertips...@googlegroups.com>
Message-Id: <d7e58e48-c160-436e-8bdf-10d86a0dc170@googlegroups.com>
Subject: Direction-dependent extrusion volume / track width?
MIME-Version: 1.0
Content-Type: multipart/mixed; 
    boundary="----=_Part_4198_351838098.1458480500604"

------=_Part_4198_351838098.1458480500604
Content-Type: multipart/alternative; 
    boundary="----=_Part_4199_1172380407.1458480500604"

------=_Part_4199_1172380407.1458480500604
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit

@icy
Copy link
Owner

icy commented Mar 26, 2016

It's impossible to use traditional method to fetch data from this group. We need to use some higher level tool like phantomjs.

Well, after days of trying scrolling method, I've finally found a way to automate the process. There are two other challenges, but they're definitely solvable.

Stay tuned!

@icybin
Copy link

icybin commented Mar 30, 2018

I have some initial works on this issue, but (1) it's slow (2) it's undetermined. Maybe I am not good at selenium.

I'm expecting there's someone can help. I can raise a small fund to support you.

Thanks a lot

@icy icy added the wontfix label May 6, 2021
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants