Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Make sel.xpath('.') work the same for text elements #130

Open
Gallaecio opened this issue Dec 18, 2018 · 2 comments
Open

Make sel.xpath('.') work the same for text elements #130

Gallaecio opened this issue Dec 18, 2018 · 2 comments

Comments

@Gallaecio
Copy link
Member

Gallaecio commented Dec 18, 2018

Given:

>>> from parsel import Selector
>>> sel = Selector(text=u"""<html>
...         <body>
...             <h1>Hello, Parsel!</h1>
...         </body>
...         </html>""")

For text, you get:

>>> subsel = sel.css('h1::text')
>>> subsel
[<Selector xpath=u'descendant-or-self::h1/text()' data=u'Hello, Parsel!'>]
>>> subsubsel = subsel.xpath('.')
>>> subsubsel
[]

However, regular elements work as you would expect:

>>> subsel = sel.css('h1')
>>> subsel
[<Selector xpath=u'descendant-or-self::h1' data=u'<h1>Hello, Parsel!</h1>'>]
>>> subsubsel = subsel.xpath('.')
>>> subsubsel
[<Selector xpath='.' data=u'<h1>Hello, Parsel!</h1>'>]

I believe text elements should work the same. '.' should select them if they are the current element.

@redapple
Copy link
Contributor

redapple commented Dec 18, 2018

Hey @Gallaecio , I'd also want to see this.
Also, I believe the issue is with lxml and not libxml2 (and not parsel either): lxml text nodes do not accept further XPath calls (you can only call .getparent() on the "smart strings" results -- note that "smart_strings" are disabled by default in parsel), while libxml2 allows XPath operations on text nodes:

>>> import libxml2
>>> doc = libxml2.htmlParseDoc('''<html>
... <head>
... <meta charset="UTF-8">
... <title>Title of the document</title>
... </head>
... 
... <body>
... Content of the document......
... </body>
... 
... </html>''', 'ascii')
>>> doc
<xmlDoc (None) object at 0x7ff070272680>
>>> ctxt = doc.xpathNewContext()
>>> res = ctxt.xpathEval("//text()")
>>> res
[<xmlNode (text) object at 0x7ff0702a2560>, <xmlNode (text) object at 0x7ff071d95320>]
>>> res[0].get_content()
'Title of the document'
>>> for t in res:
...     print(t.xpathEval("parent::*"))
... 
[<xmlNode (title) object at 0x7ff07025e7e8>]
[<xmlNode (body) object at 0x7ff07025e878>]
>>> 

If you know Cython, it could be a nice addition to lxml to support this

@redapple
Copy link
Contributor

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

2 participants