Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Add format_as to extract() methods #101

Closed
wants to merge 5 commits into from

Conversation

Granitosaurus
Copy link

No description provided.

@Granitosaurus
Copy link
Author

Also I've noticed self._tostring_method is redundant, in a way that self.type will always resolve to the same value.

@eliasdorneles
Copy link
Member

Hi @Granitosaurus ! 👋 :)

Been thinking about this, I don't like just passing along in this case, though I'm not sure if I can articulate precisely why, but let me try.
For starters, the methods receiving the arguments are different abstractions, so this leads me thinking that it will make maintenance quite more complex. Like, we don't have any control of which arguments may be added in the future to the tostring method. Their behavior may depend too much on the lxml version being used.
So I'd prefer not to make the same arguments that method take as part of the Parsel API simply passing them through.

I'd be in favor of either passing along explicitly an argument that we're interested (format=True for pretty printing) or adding a new dict argument lxml_format_options for the extra kwargs to be passed to tostring, which we can then document that are passed to lxml.etree.tostring.

This way, if in the future Parsel gets another backend or something (as suggested in #29), the API will be consistent.

Another reason a direct passthrough isn't a good idea is that it would allow to override the values being set by Parsel, which doesn't look that great -- is this needed in some use case?
If it's really needed, than we should make the argument name to be lxml_format_options_overrides to make explicit they're overriding what Parsel thinks it's true. :)

About the _tostring_method being redundant (as in having same value as self.type), however they are two different concepts and it makes sense to have them decoupled (for example, in the future we can ).

Does this make sense?

@Granitosaurus
Copy link
Author

Hey Elias,
Yeah your explanation makes sense to me 👌

Personally I just want to get pretty_print working, however sometimes it's necessary to change method from hmtl to xml for pretty formatting to work properly.

So I feel that these two arguments pretty_print and method would be great for passthrough, but wrapping it in some way to be api compliant as you recommended is perfectly fine by me.

I'd be in favor of either passing along explicitly an argument that we're interested (format=True for pretty printing)

I feel that together with format, method also needs to be a kwarg. Since if tostring method kwarg doesn't match Etree method it fails to pretty print:

>>> foo = """<div><div>wow much text<span>I'm inside!</span></div><div>bro</div></div>"""   
>>> print(etree.tostring(etree.fromstring(foo), pretty_print=True, method='xml').decode('
... utf8'))
<div>
  <div>wow much text<span>I'm inside!</span></div>
  <div>bro</div>
</div>


>>> print(etree.tostring(etree.fromstring(foo), pretty_print=True, method='html').decode(
... 'utf8'))
<div>
<div>wow much text<span>I'm inside!</span>
</div>
<div>bro</div>
</div>

Real case:

sel.xpath('//div').extract(format=True) 
# will default to html here eventhough etree thinks it's xml and will not pretty print

All in all I feel that one of these things would be a great addition to parsel.Selector.extract() method:

  1. expose method and format (format=True == pretty_print=True in this case)
  2. Handle method internally - If there is some way to check Etree's method it could be matched in tostring but I'm not sure whether Etree objects have it defined somewhere (couldn't find anything on it).
  3. Add lxml_format_options kwarg - I like this one but I don't know anything about other API's, if parsel were to get few more then it would get convoluted pretty quickly if.

@eliasdorneles
Copy link
Member

So maybe it could be a new keyword argument format=None, that also accepts format='xml' and format='html', what do you think?

@Granitosaurus
Copy link
Author

Uuu, that's clever!
I'll give it a go and see how it turns out.

@Granitosaurus Granitosaurus changed the title add kwargs passthrough to extract() methods WIP: add kwargs passthrough to extract() methods Oct 17, 2017
@Granitosaurus
Copy link
Author

I've updated the PR to reflect the discussed change:

>>> from parsel import Selector
>>> sel = Selector(text='<div><span>foo</span><span>bar</span></div>')
>>> print(sel.extract(format='xml'))
<html>
  <body>
    <div>
      <span>foo</span>
      <span>bar</span>
    </div>
  </body>
</html>
>>> print(sel.extract(format='html'))
<html><body><div>
<span>foo</span><span>bar</span>
</div></body></html>

>>> sel_list = SelectorList([sel])
>>> sel_list.extract()
['<html><body><div><span>foo</span><span>bar</span></div></body></html>']
>>> sel_list.extract(format='xml')
['<html>\n  <body>\n    <div>\n      <span>foo</span>\n      <span>bar</span>\n    </div>\n  </body>\n</html>\n']
>>> sel_list.extract(format='html')
['<html><body><div>\n<span>foo</span><span>bar</span>\n</div></body></html>\n']

However format is reserved by python, so maybe it would be more idiomatic to change it to some other keyword? Or fallback to format_ as per pep8 recommendation:

single_trailing_underscore_: used by convention to avoid conflicts with Python keyword,

@codecov
Copy link

codecov bot commented Oct 17, 2017

Codecov Report

Merging #101 into master will not change coverage.
The diff coverage is 100%.

Impacted file tree graph

@@          Coverage Diff          @@
##           master   #101   +/-   ##
=====================================
  Coverage     100%   100%           
=====================================
  Files           5      5           
  Lines         248    260   +12     
  Branches       46     49    +3     
=====================================
+ Hits          248    260   +12
Impacted Files Coverage Δ
parsel/selector.py 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 68d64db...bb4b38b. Read the comment docs.

@eliasdorneles
Copy link
Member

However format is reserved by python, so maybe it would be more idiomatic to change it to some other keyword? Or fallback to format_ as per pep8 recommendation:

oh yeah, I had forgotten, sorry about that! 😆

how about format_as ?

@Granitosaurus
Copy link
Author

updated the keyword to format_as :)

Copy link
Member

@eliasdorneles eliasdorneles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's now in the right direction, but it needs some tests. :)

method=format_as or self._tostring_method,
encoding='unicode',
with_tail=False,
pretty_print=bool(format_as),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look right, it's using the boolean value of format_as.
Can you please add tests for the full behavior? :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eliasdorneles the logic behind this is that if format_as is explicitly specified then it should pretty_print as well. Alternatively pretty_print argument should be added separately I guess

@ghost
Copy link

ghost commented Feb 27, 2018

Hey @Granitosaurus - this is a neat little feature! Are you still interested in implementing it? If not, perhaps we can still use your commits as part of an implementation?

@Granitosaurus
Copy link
Author

Hey @cathalgarvey,
I complete forgot about this PR, I'll continue the work today, sorry.

@eliasdorneles eliasdorneles changed the title WIP: add kwargs passthrough to extract() methods Add format_as to extract() methods Jun 22, 2018
@eliasdorneles
Copy link
Member

Sorry for the long time to re-review!
This looks mostly good now, but needs solving the conflicts.

@kmike how do you feel about this one? Usage will look like sel.get(format_as='xml') / sel.get(format_as='html')

@kmike
Copy link
Member

kmike commented Jun 22, 2018

Hey! I'm not sure I understand it properly - how should user decide between format_as='xml' and format_as='html'? Try different options until result is reasonable? Can we do it automatically somehow? From API point of view, it seems the intention is just .get(pretty=True).

@rodp63
Copy link

rodp63 commented Jan 15, 2019

It's been a long time since the last comment!
I consider that the fact of adding the option format_as is just for a better visualization of the text, either html or xml. I agree with @kmike to just add the option pretty=True/False. What do you think?

@Gallaecio
Copy link
Member

@Granitosaurus Any thoughts? I see there’s been quite some back and forward.

@Gallaecio Gallaecio closed this Sep 17, 2019
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants