Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

SelectorList.drop() removing elements doesn't work as expected #297

Open
dream2333 opened this issue Jun 8, 2024 · 5 comments · May be fixed by #298 or #299
Open

SelectorList.drop() removing elements doesn't work as expected #297

dream2333 opened this issue Jun 8, 2024 · 5 comments · May be fixed by #298 or #299

Comments

@dream2333
Copy link

dream2333 commented Jun 8, 2024

def parse_detail(self, response: HtmlResponse, item: DetailDataItem):
    selectors = response.jmespath("news.body")
    selectors.xpath(".//script|.//style").drop()
    item.content = selectors.xpath("string(.)").get().strip()
    yield item

I'm trying to remove the 'style' tag from the element using selector.xpath(".//script|.//style").drop(). However, even after executing this line of code, the 'style' element still exists in the DOM.

微信截图_20240609010908

Here's url:
https://newsinfo.eastmoney.com/kuaixun/v2/api/content/getnews?newsid=202406083099747443&newstype=1

@dream2333
Copy link
Author

Could someone help me understand why this is happening?

@dream2333
Copy link
Author

I've figured out why this is happening. If you perform a drop operation on a Selector that's been created from JSON in Scrapy, it cannot correctly handle the DOM. However, if you extract the HTML text from the JSON and reconstruct the Selector, this issue does not occur. This seems to be a bug in Parsel's Selector implementation.

content = response.jmespath("news.body").get()
selector = Selector(text=content, type="html")
selector.xpath(".//script|.//style").drop()
item.content = selector.xpath("string(.)").get().strip()

@dream2333
Copy link
Author

dream2333 commented Jun 9, 2024

When using the .xpath method to create nodes from a text type selector, it appears that these nodes are actually copies generated from the text, rather than being generated based on the original root node. As a result, when executing the .drop method, it doesn't affect the content of the original HTML tree. This happens mostly when using jmespath and xpath in combination

This process is quite subtle. To make the .drop operation effective, we need to call .xpath(".") to generate a new HtmlSelector. Only then does the .drop operation work as expected on it. This behavior is not intuitive and could potentially lead to confusion or unexpected results. I believe it would be beneficial to either adjust this behavior or clarify it in the documentation to prevent future confusion.

selector = json_selector.jmespath("news.body").xpath(".")
selectors.xpath(".//script|.//style").drop()
item.content = selectors.xpath("string(.)").get().strip()

@dream2333
Copy link
Author

Refs #298

@Tanjir369

This comment was marked as off-topic.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
2 participants