Integrate Agency Labeling Logic #139

maxachis · 2025-01-20T14:12:36Z

Set up endpoint to pull URLs that did not have an agency identified by a collector, for manual labeling.

maxachis · 2025-02-01T17:41:46Z

So this task will have several components:

We'll need to determine if a particular URL has information that allows us to easily connect it with an existing agency. This is very much dependent on the collector metadata that URL has, which is itself dependent on what batch it comes from (for example: common crawler does not provide any collector metadata, while CKAN provides data on agency names.
We'll need to connect to the existing /api/match/agencies Data Sources app URL, and see if we receive an agency suggestion
We'll need to determine how to handle receiving an agency suggestion -- should we consider that an automatic match, or require manual confirmation?
Similarly, if we receive multiple agency suggestions, we will absolutely need some means to determine relevancy. While a machine learning solution might make sense at some point in the future, at the moment it's probably easier to just do manual labeling.
We'll need to have logic for when there appears to be a new agency that does not currently exist in our database. That's probably an issue on its own, and unsure if that's a Data Sources App issue, a Source Collector App issue, or both.
We'll need another retool component for annotating based on URLs. This one will be a little more complex, as we'll need to be able to look up existing agencies but also potentially mark certain URLs as containing novel agencies, which we may need to submit.

maxachis · 2025-02-01T17:46:08Z

So I'll start off looking at step 1 of this process, for each of our collectors:

Common Crawler: No collector metadata. Must be marked for manual identification.
Ckan: Agency name included. Can be piped to /api/match/agencies
AutoGoogler: Query can contain information, but queries can theoretically be for multiple things, so they would need contextualized.
Muckrock: Includes an agency ID which will need to be separately looked up at https://www.muckrock.com/api_v1/agency/{agency_id}, as seen here.

CKAN and MuckRock are candidates for automatic labeling (even as just a pre-annotation), while AutoGoogler and Common Crawler are not.

josh-chamberlain · 2025-02-05T16:57:50Z

@maxachis

AutoGoogler: Query can contain information, but queries can theoretically be for multiple things, so they would need contextualized.

we could limit things so that it's one agency per query, and multi-agency queries are just multiple queries, if that helps reduce complexity. seems like a fair enough tradeoff

CKAN and MuckRock are candidates for automatic labeling (even as just a pre-annotation), while AutoGoogler and Common Crawler are not.

Again, limitations might be helpful here—if AutoGoogler's search terms are focused enough, those could come through as pre-annotations. i.e. a query looking like arrest records in pittsburgh, pa could create pre-annotation. Unless there's some other reason AutoGoogler is not a candidate that I'm not expecting!

maxachis · 2025-02-07T15:42:27Z

we could limit things so that it's one agency per query, and multi-agency queries are just multiple queries, if that helps reduce complexity. seems like a fair enough tradeoff
Again, limitations might be helpful here—if AutoGoogler's search terms are focused enough, those could come through as pre-annotations. i.e. a query looking like arrest records in pittsburgh, pa could create pre-annotation. Unless there's some other reason AutoGoogler is not a candidate that I'm not expecting!

@josh-chamberlain So the way I'm envisioning it is already one agency per query, so we've got that covered. When I said "queries can theoretically be for multiple things", I meant that we might use the AutoGoogler for reasons beyond searching for agencies. Although on reflection that doesn't in theory preclude what you're talking about.

The main difficulty with AutoGoogler is that AutoGoogler can help us find a url for an agency, but there's not a reliable way to determine the name or location data from the URL or its HTML content. At minimum, we'd need to do some Natural Language Processing with a wide latitude for the different ways the data could be presented. That could be a fruitful direction to pursue, but it wouldn't be low-hanging fruit.

Narrowing the search queries could help, but would still present uncertainty: If I search for "Loremtown, Ipsum County, Massachusetts Police Department", and I look at the first result, how do I know that it is a homepage for the relevant police department? How do I determine what the name is so I can look it up in our database? It's not an impossible task, but like I said, not low-hanging fruit.

maxachis · 2025-02-07T17:53:35Z

Design Thinking

Higher Level Thinking

Table Design

Because we're using multiple databases, and content from Data Sources DB (users, agencies) is used in Source Collector DB (in this use case alone, for annotations), I had to figure out how to handle that. I considered a few options:

Use PostgreSQL's Foreign Data Wrapper (FDW) to query DS from SC. Rejected because this would be hell to test in isolation.
Query for agency suggestions via /match and store each suggestion individually for a given URL. Rejected because the same agency could be used for multiple URLs, so this would be wasteful.
Just copy our agency and user data completely over to SC from DS. Rejected because a most users and many agencies will never be involved, so this would be wasteful.
Query agencies initially via /match and store agencies we don't yet have in a secondary table in SC, which are then used when referencing agency suggestions. Effectively like a cache.

I opted for 4. This does introduce some data redundancy and the risk of data becoming stale. But agencies and users are unlikely to remain relatively static, and we can figure out ways to deal with staleness later, as we've refined the prototype.

Handling New Agencies

If an agency cannot be determined, we have a candidate for submitting a new agency. This part is not well-defined, but my sense is that we would need to simply submit a new agency, have that be approved, and then pull up that agency in the match endpoint.

Table Design

`agencies`

A partial mirror/cache of agency data.

agency_id: Deliberately not named id to emphasize that it is not an autogenerated row but in fact a reference to the agency in DS. Primary key with uniqueness enforced
name
state: if applicable
county: if applicable
locality: if applicable
updated_at: Indicated when the agency was retrieved/populated

`confirmed_url_agency`

Used when an agency has been confirmed associated with a url.

id: Auto-generated link ID
agency_id: Foreign key to local agencies table
url_id: Foreign key to url table
Unique constraint on (agency_id, url_id)

`automated_url_agency_suggestions`

Suggestions automatically retrieved for a URL

id: Auto-generated ID
agency_id: Foreign key to local agencies table, denoting the suggested agency
url_id: Foreign key to url
is_unknown: True if auto-identifier cannot determine prospective agencies, null otherwise
Unique constraint on (agency_id, url_id)

`user_url_agency_suggestions`

User suggestions for an agency for a URL

id: Auto-generated ID
agency_id: Foreign key to local agencies table, denoting the suggested agency. Nullable if none found.
url_id: Foreign key to url
user_id: Foreign key referencing DS Database's user table (not a local user table, which does not exist)
is_new: True if user believes the url is for a new agency, null otherwise
Unique constraint on (agency_id, url_id, user_id)

Workflow

The Agency Identification Task identifies URLs without automated agency suggestions and retrieves potential agencies.
A. If an agency is 100% determined through this process, the agency-url link is confirmed (via its presence in the confirmed_url_agency table
B. Otherwise, a suite of agency auto-suggestions (up to 10) are provided for the URL, or it is listed as unknown.
Users will receive agencies they have not yet provided a suggested agency for
A. Users will either mark one of the provided suggestions, provide a new suggestion, or mark the agency as NEW.
B. Agencies marked NEW will need to be handled in a separate workflow -- that is beyond the scope of this
Later parts of the workflow will be done in a future issue

Future Developments

Will need to make an issue for handling regular database synchronization to attend to stale data. Possibly a scheduled background job
Determine, in coordination with possible chances to the DS App, how to create new agencies and then assign them to agencies given a NEW suggestion

maxachis mentioned this issue Feb 5, 2025

Add logic for Agency Identification #145

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Agency Labeling Logic #139

Integrate Agency Labeling Logic #139

maxachis commented Jan 20, 2025

maxachis commented Feb 1, 2025 •

edited

Loading

maxachis commented Feb 1, 2025 •

edited

Loading

josh-chamberlain commented Feb 5, 2025 •

edited

Loading

maxachis commented Feb 7, 2025

maxachis commented Feb 7, 2025

Integrate Agency Labeling Logic #139

Integrate Agency Labeling Logic #139

Comments

maxachis commented Jan 20, 2025

maxachis commented Feb 1, 2025 • edited Loading

maxachis commented Feb 1, 2025 • edited Loading

josh-chamberlain commented Feb 5, 2025 • edited Loading

maxachis commented Feb 7, 2025

maxachis commented Feb 7, 2025

Design Thinking

Higher Level Thinking

Table Design

Handling New Agencies

Table Design

agencies

A partial mirror/cache of agency data.

confirmed_url_agency

Used when an agency has been confirmed associated with a url.

automated_url_agency_suggestions

Suggestions automatically retrieved for a URL

user_url_agency_suggestions

User suggestions for an agency for a URL

Workflow

Future Developments

maxachis commented Feb 1, 2025 •

edited

Loading

maxachis commented Feb 1, 2025 •

edited

Loading

josh-chamberlain commented Feb 5, 2025 •

edited

Loading

`agencies`

`confirmed_url_agency`

`automated_url_agency_suggestions`

`user_url_agency_suggestions`