You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In addition to agency homepages, but maybe more complicated, we could use data sources with record_type="List of Data Sources" to generate possible URLs
Requirements
use our API to get data sources with record_type="List of Data Sources"
select the most promising-seeming ones or
create an all-purpose crawler which is good taking a website like that and getting all the URLs on the page that might be data sources
it can be aggressive and get false positives, because the end goal is to run it through the identification pipeline
generate a batch in Hugging Face
run this crawler periodically, maybe monthly
Docs
What docs should be updated? Link to related docs changes in the PR.
Open questions
we should be mindful of duplicates! after the first time this runs, we're going to get some. in general, when using things like common crawl, we should likely avoid running duplicate URLs through the identification pipeline
The text was updated successfully, but these errors were encountered:
Context
This is an alternative to using
common_crawler
to find a batch of URLs.related to #54
Requirements
Docs
Open questions
we should be mindful of duplicates! after the first time this runs, we're going to get some. in general, when using things like common crawl, we should likely avoid running duplicate URLs through the identification pipeline
The text was updated successfully, but these errors were encountered: