Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add option to retrieve text content #128

Open
frederik-elwert opened this issue Nov 16, 2018 · 9 comments
Open

Add option to retrieve text content #128

frederik-elwert opened this issue Nov 16, 2018 · 9 comments

Comments

@frederik-elwert
Copy link

As a scrapy user, I often want to extract the text content of an element. The default option in parsel is to either use the ::text pseudo-element or XPath text(). Both options have the downside that they return all text nodes as individual elements. When the element contains child elements, this creates unwanted behavior. E.g.:

<html>
<body>
<h2>This is the <em>new</em> trend!</h2>
<p class="post_info">Published by newbie<br>on Sept 17</p>
</body>
</html>
>>> response.css('h2::text').extract()
['This is the ', ' trend!']
>>> response.css('.post_info::text').extract()
['Published by newbie', 'on Sept 17']

With a basic understanding of XML and XPath, this behavior is expected. But it requires overhead to work around it, and it often creates frustrations with new users. There is a series of questions on stackoverflow as well as on the scrapy bug tracker:

lxml.html has the convenience method .text_content() that collects all of the text content of an element. Somethings similar could be added to the Selector and SelectorList classes. I could imagine two ways to approach the required API:

  • Either, there could be additional .extract_text()/.get_text() methods. This seems clean and easy to use, but would lead to potentially convoluted method names like .extract_first_text() (or .extract_text_first()?).
  • Or add a parameter to .extract*()/.get(), similar to the proposal in Add format_as to extract() methods #101. This could be .extract(format_as='text'). This is less intrusive, but maybe less easy to discover.

Would such an addition be welcome? I could prepare a patch.

@kmike
Copy link
Member

kmike commented Nov 17, 2018

Hey @frederik-elwert! This is being worked on here: #127 :)

@kamrankausar
Copy link

Please consider this as basic feature and add It.

@joecabezas
Copy link

+1

@bblanchon
Copy link

Any progress on this issue?

@kmike
Copy link
Member

kmike commented Feb 10, 2022

Not much, but I've merged master to #127 yesterday, so the PR is up-to-date now. I think feature-wise it is ready; I'm happy with the implementation. But it needs some cleanup - more docs and tests.

@celsofranssa
Copy link

Any progress on this issue?

@mhillebrand
Copy link

This still hasn't been addressed?

@GeorgeA92
Copy link
Contributor

One working option Is to use.. chaining css calls with *::text query applied to selector that contain text we aimed to scrape.
Applied solution on example html sample from issue description will look like this:

from parsel import Selector

text='''
<html>
<body>
<h2>This is the <em>new</em> trend!</h2>
<p class="post_info">Published by newbie<br>on Sept 17</p>
</body>
</html>
'''

sel = Selector(text=text)

# All text
print(sel.css('h2').css('*::text').extract())
# ['This is the ', 'new', ' trend!']

print(sel.css('.post_info').css('*::text').extract())
# ['Published by newbie', 'on Sept 17']

print(sel.css('*::text').extract())
# ['\n', '\n', 'This is the ', 'new', ' trend!', '\n', 'Published by newbie', 'on Sept 17', '\n', '\n']

It is not perfect but (at least for usecases I had) - it is already enough to cover this and similar cases (without digging deep into lxml internals).

lxml.html has the convenience method .text_content() that collects all of the text content of an element. Somethings similar could be added to the Selector and SelectorList classes...

I just realized that Selector.root - is lxml's html object created by it's create_root_node method. It means that if parser type is html - mentioned text_content can be applied here (as well as any other it's lxml methods):

print(sel.root.text_content())
'''

This is the new trend!
Published by newbieon Sept 17


'''

Cases when Selector query return SelectorList a bit more complicated:

print([s.root.text_content() for s in sel.css('h2')])
# ['This is the new trend!']

print([s.root.text_content() for s in sel.css('.post_info')])
# ['Published by newbieon Sept 17']

Applying bind to lxml's text_content into Selector and SelectorList types - looks like the most practical approach here.

As far as I understand both options mentioned above was technically applicable on 2018 when this ticket was created.

@mhillebrand
Copy link

Ugh. I guess I'll just stick to the selectolax library. I'm a big fan of its text() method. It's got deep, separator, and strip parameters. It's also incredibly fast. The major drawback is that it doesn't support XPath.

image

image

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

9 participants