List of Data Sources batch generator #63

josh-chamberlain · 2024-04-02T15:29:27Z

Context

This is an alternative to using common_crawler to find a batch of URLs.

related to #54

In addition to agency homepages, but maybe more complicated, we could use data sources with record_type="List of Data Sources" to generate possible URLs

Requirements

use our API to get data sources with record_type="List of Data Sources"
select the most promising-seeming ones or
create an all-purpose crawler which is good taking a website like that and getting all the URLs on the page that might be data sources
- it can be aggressive and get false positives, because the end goal is to run it through the identification pipeline
generate a batch in Hugging Face
run this crawler periodically, maybe monthly

Docs

What docs should be updated? Link to related docs changes in the PR.

Open questions

we should be mindful of duplicates! after the first time this runs, we're going to get some. in general, when using things like common crawl, we should likely avoid running duplicate URLs through the identification pipeline

The text was updated successfully, but these errors were encountered:

josh-chamberlain added the new idea label Apr 2, 2024

josh-chamberlain mentioned this issue Jul 30, 2024

scrape CKAN for data sources Police-Data-Accessibility-Project/data-projects#9

Open

2 tasks

josh-chamberlain mentioned this issue Aug 16, 2024

Expand the dataset to include a wider variety of urls and labels #30

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List of Data Sources batch generator #63

List of Data Sources batch generator #63

josh-chamberlain commented Apr 2, 2024 •

edited

Loading

List of Data Sources batch generator #63

List of Data Sources batch generator #63

Comments

josh-chamberlain commented Apr 2, 2024 • edited Loading

Context

Requirements

Docs

Open questions

josh-chamberlain commented Apr 2, 2024 •

edited

Loading