-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Add option to retrieve text content #128
Comments
Hey @frederik-elwert! This is being worked on here: #127 :) |
Please consider this as basic feature and add It. |
+1 |
Any progress on this issue? |
Not much, but I've merged master to #127 yesterday, so the PR is up-to-date now. I think feature-wise it is ready; I'm happy with the implementation. But it needs some cleanup - more docs and tests. |
Any progress on this issue? |
This still hasn't been addressed? |
One working option Is to use.. chaining css calls with from parsel import Selector
text='''
<html>
<body>
<h2>This is the <em>new</em> trend!</h2>
<p class="post_info">Published by newbie<br>on Sept 17</p>
</body>
</html>
'''
sel = Selector(text=text)
# All text
print(sel.css('h2').css('*::text').extract())
# ['This is the ', 'new', ' trend!']
print(sel.css('.post_info').css('*::text').extract())
# ['Published by newbie', 'on Sept 17']
print(sel.css('*::text').extract())
# ['\n', '\n', 'This is the ', 'new', ' trend!', '\n', 'Published by newbie', 'on Sept 17', '\n', '\n'] It is not perfect but (at least for usecases I had) - it is already enough to cover this and similar cases (without digging deep into lxml internals).
I just realized that Selector.root - is lxml's html object created by it's print(sel.root.text_content())
'''
This is the new trend!
Published by newbieon Sept 17
''' Cases when Selector query return print([s.root.text_content() for s in sel.css('h2')])
# ['This is the new trend!']
print([s.root.text_content() for s in sel.css('.post_info')])
# ['Published by newbieon Sept 17']
Applying bind to lxml's As far as I understand both options mentioned above was technically applicable on 2018 when this ticket was created. |
Ugh. I guess I'll just stick to the selectolax library. I'm a big fan of its |
As a scrapy user, I often want to extract the text content of an element. The default option in parsel is to either use the
::text
pseudo-element or XPathtext()
. Both options have the downside that they return all text nodes as individual elements. When the element contains child elements, this creates unwanted behavior. E.g.:With a basic understanding of XML and XPath, this behavior is expected. But it requires overhead to work around it, and it often creates frustrations with new users. There is a series of questions on stackoverflow as well as on the scrapy bug tracker:
lxml.html
has the convenience method.text_content()
that collects all of the text content of an element. Somethings similar could be added to theSelector
andSelectorList
classes. I could imagine two ways to approach the required API:.extract_text()
/.get_text()
methods. This seems clean and easy to use, but would lead to potentially convoluted method names like.extract_first_text()
(or.extract_text_first()
?)..extract*()
/.get()
, similar to the proposal in Add format_as to extract() methods #101. This could be.extract(format_as='text')
. This is less intrusive, but maybe less easy to discover.Would such an addition be welcome? I could prepare a patch.
The text was updated successfully, but these errors were encountered: