-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
RFC: limited time-range #6
Comments
I totally agree that additional filtering which prevents unnecessary crawling is needed. If a lot of plugins support filtering by date, global arguments to specify a timerange could be considered and should cover all timerange specific filtering needs. Currently I'd prefer per plugin arguments. I don't think those would clutter the namespace, since it's a separated namespace local to the plugin. You could put anything in there. Of course it would be nice from a user's perspective, if arguments that do the same thing would be consistently named across all plugins that use it. Codewise this would merely be a convention. Considering your smartbroker example, I'd go with the one option that covers all other cases: startdate and enddate as this covers the "last x days" option already. For start- and enddate argument names, I'd propose: @click.option(
"-f",
"from",
type=docdl.types.ClickParsedDateTime(),
envvar="DOCDL_SMARTBROKER_FROM",
show_envvar=True,
default=None,
help="only process documents newer than this"
)
@click.option(
"-u",
"until",
type=docdl.types.ClickParsedDateTime(),
envvar="DOCDL_SMARTBROKER_UNTIL",
show_envvar=True,
default=None,
help="only process documents older than this"
) I'd create a custom global click.ParamType that parses all sorts of date/time formats to a datetime object. Then plugins could just use this. This "convention" for temporal arguments could even apply to the amazon plugin when just taking the year from any of the two --from/--until datetime objects. Do you think that could cover your (future) use cases? Am I missing something or overthinking it? I also think varying plugin argument names like "--year", "--start-date", "--new", etc. (Implementing "--search-keyword" for plugins that support filtering by keyword search would also be nice, btw.) |
My main prospective use case is regularly downloading documents. Like a few times per month, always fetching documents from within the last 30 days, so the less overhead the better. Thinking further, I'd like rsync-like behaviour: Only download new documents, don't overwrite existing ones with the same name. Possibly even with a configuration file to determine to which directory files from a certain plugin should be saved and how to obtain login data (keyring) for that job. The parameters -f and -u are both used at top level already. I'd like to avoid reusing them since from user perspective it's confusing when an option -x at different positions does different stuff. How about "-l" / "--limit-to" or "--limit-time" which parses e.g.:
For parsing dt and deltas, khal does a lot of that: https://github.com/pimutils/khal/blob/master/khal/parse_datetime.py Regarding conventions: When not "enforcing" them (which in my experience leads to the cleanest results), I'd propose to clearly document them either in the Readme or some other file. We've already got the cases "call report", and temporal parameters. With Commerzbank I've got a lot of different document types, i.e. categories. => https://github.com/sdx23/document-dl/blob/conventions/CONVENTIONS.md ? |
I do the same and I feel the additional crawling time is neglectable since the process runs unattended, only once a month and selenium is very slow by itself. Only launching selenium makes at least about 50% of the runtime in my case.
That's the beauty of having a main context and a plugin context.
I don't think so. There are separate help texts and there is no way to confuse them. Other programs do that a lot (e.g. "git -c" vs. "git commit -c" vs. "git branch -c")
I don't like the idea of adding a ton of complexity just to save one commandline option. Parsing timedeltas is hugely complex and although docdl.util.dateparser is naive on it's own, it's based on dateutil() and strptime() and never meant to parse timeranges. Just use two options and assume sane defaults if one is missing. Instead of a timedelta, just calculate start & enddate (i'm doing this using the GNU "date" command which was solely made for this purpose.)
Documentation is scarce at the moment and there still might be a lot of volatility in the code. Besides from that, I see no reason against additional documentation.
It'd be nice if plugins could report their document categories but I didn't implement it because the plugin doesn't necessarily know them until after the scrape. Plugins are free to use the strings provided by the webportal service provider (like it's done in the o2.de plugin). The web portal changing the category string would imply additional maintenance overhead for those plugins. I'd suggest to start with a list of mandatory and optional attributes that plugins should (or can) add to documents.
Traditionally a "Plugins" section in some DEVELOP.md could be used for that. It could be linked from the README.md and the "Writing a plugin" section could be moved there. |
Let me first say that I totally agree with the statement in #4 that we should take care not to clutter the namespace [1]. After all, I see unification/standardization as one big aspect of this project, that distinguishes it from me hacking a short standalone selenium-script for just the websites/services I need.
As well, I do agree that adding a special "--year" switch or the like is redundant with the ability of jq querying (and that also put me at unease with my suggestion in #4). But I also see the point of dates being something special since they are (possibly the main) restriction on which documents to process.
That is important on the one hand for speed / not doing a lot of useless work (see #4). This is relevant for regular downloading as mentioned in #1: when the script runs periodically once a month, it is surely fine to only (try) to download documents from within the last 60 days.
On the other hand, the scraped website itself may raise that question (and that's why I bring up this topic again). I'm currently developing a plugin for smartbroker [2], which displays the postbox by default as a search form [3]. It allows selecting predefined ranges (the last x days with x in [10,30,...360] or alternatively specifying your own range. So I could either -- quite arbitrarily -- select "last 360 days" or do something (possibly stupid?) like setting the range to 1970-01-01 til today.
Now this is specific to the plugin in question, and in principle I'd just leave it as is (360 days) and possibly specify an additional option making all documents available by forcing "1970-01-01 til today". But from a user perspective it might get confusing, what one must do for which plugin to work as expected from the experience with others.
Not sure I'm overthinking this; this can still be changed at some later point. Nevertheless I wanted to bring this to attention and ask for comments.
[1] sidenote, that is offtopic here: raises the question whether short cli options should be discouraged in plugins
![2022-01-07-180639_624x359_scrot](https://user-images.githubusercontent.com/1408276/148580134-c65eae7f-23e7-485c-80f2-3a5fa53607d5.png)
[2] https://github.com/sdx23/document-dl/tree/smartbroker
[3]
The text was updated successfully, but these errors were encountered: