Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

ENH: Expose date parsing arguments in read_html function #49553

Open
1 of 3 tasks
csala opened this issue Nov 6, 2022 · 3 comments
Open
1 of 3 tasks

ENH: Expose date parsing arguments in read_html function #49553

csala opened this issue Nov 6, 2022 · 3 comments
Labels
Datetime Datetime data dtype Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap

Comments

@csala
Copy link

csala commented Nov 6, 2022

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

The current read_html function exposes the same parse_dates argument that read_csv has, but it does not expose the rest of arguments that let the user control how the dates are parsed (infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates).

Other arguments unrelated to date parsing may be in the same situation, so maybe this issue could be extended to cover them all.

Feature Description

These arguments, or at least some of them, could be easily exposed directly in the read_html without much hassle, which would be very convenient for the user.

Alternative Solutions

Right now the only viable solution is to skip date parsing altogether during the data loading step and then manually implement the date parsing over the returned data frame.

The problem with this is that it breaks the API uniformity with read_csv, making the implementation of integrations with different input data sources different depending on the data format (function call with arguments vs function call with arguments + postprocessing), while also potentially skipping any optimizations implemented during the read_csv workflow.

Additional Context

From what I could tell skimming over the code, the read_html function only adds a few layers of code on top of the underlying parser, which already supports all the mentioned arguments, and parse_dates is simply pushed down to it untouched letting the parser use the default values for all the others arguments in the list above.

@csala csala added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Nov 6, 2022
@csala
Copy link
Author

csala commented Nov 6, 2022

I can actually confirm that the mentioned arguments work by simply adding them here:

def read_html(

And then passing them down here:

return _parse(

since, they are automatically added within the generic **kwargs dict here:

def _parse(flavor, io, match, attrs, encoding, displayed_only, extract_links, **kwargs):

which is later on pushed down until the corresponding parser reads them.

I'd be happy to make a PR if this is an acceptable change.

@lithomas1 lithomas1 added IO HTML read_html, to_html, Styler.apply, Styler.applymap Datetime Datetime data dtype and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 27, 2022
@MarcoGorelli
Copy link
Member

thanks for the request

date_format will be added as part of #51019, but for the others the advice will likely keep being to parse as object and convert to datetime after that

@csala
Copy link
Author

csala commented Feb 7, 2023

@MarcoGorelli So after #51019 will the read_csv and read_html signatures be aligned? Or will they continue to behave differently.

To be honest, I am not in favor of forcing multiple steps (read data first and then parse datetimes), but my concern was not really that much about the date parsing step, and more about the fact that ingesting data via read_csv and read_html had different steps required.

In any case, if this is not going to be addressed, please feel free to close this issue.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
Datetime Datetime data dtype Enhancement IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
Development

No branches or pull requests

3 participants