Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Integrate Agency Labeling Logic #139

Open
maxachis opened this issue Jan 20, 2025 · 5 comments
Open

Integrate Agency Labeling Logic #139

maxachis opened this issue Jan 20, 2025 · 5 comments

Comments

@maxachis
Copy link
Collaborator

Set up endpoint to pull URLs that did not have an agency identified by a collector, for manual labeling.

@maxachis
Copy link
Collaborator Author

maxachis commented Feb 1, 2025

So this task will have several components:

  1. We'll need to determine if a particular URL has information that allows us to easily connect it with an existing agency. This is very much dependent on the collector metadata that URL has, which is itself dependent on what batch it comes from (for example: common crawler does not provide any collector metadata, while CKAN provides data on agency names.
  2. We'll need to connect to the existing /api/match/agencies Data Sources app URL, and see if we receive an agency suggestion
  3. We'll need to determine how to handle receiving an agency suggestion -- should we consider that an automatic match, or require manual confirmation?
  4. Similarly, if we receive multiple agency suggestions, we will absolutely need some means to determine relevancy. While a machine learning solution might make sense at some point in the future, at the moment it's probably easier to just do manual labeling.
  5. We'll need to have logic for when there appears to be a new agency that does not currently exist in our database. That's probably an issue on its own, and unsure if that's a Data Sources App issue, a Source Collector App issue, or both.
  6. We'll need another retool component for annotating based on URLs. This one will be a little more complex, as we'll need to be able to look up existing agencies but also potentially mark certain URLs as containing novel agencies, which we may need to submit.

@maxachis
Copy link
Collaborator Author

maxachis commented Feb 1, 2025

So I'll start off looking at step 1 of this process, for each of our collectors:

  • Common Crawler: No collector metadata. Must be marked for manual identification.
  • Ckan: Agency name included. Can be piped to /api/match/agencies
  • AutoGoogler: Query can contain information, but queries can theoretically be for multiple things, so they would need contextualized.
  • Muckrock: Includes an agency ID which will need to be separately looked up at https://www.muckrock.com/api_v1/agency/{agency_id}, as seen here.

CKAN and MuckRock are candidates for automatic labeling (even as just a pre-annotation), while AutoGoogler and Common Crawler are not.

@josh-chamberlain
Copy link
Contributor

josh-chamberlain commented Feb 5, 2025

@maxachis

AutoGoogler: Query can contain information, but queries can theoretically be for multiple things, so they would need contextualized.

we could limit things so that it's one agency per query, and multi-agency queries are just multiple queries, if that helps reduce complexity. seems like a fair enough tradeoff

CKAN and MuckRock are candidates for automatic labeling (even as just a pre-annotation), while AutoGoogler and Common Crawler are not.

Again, limitations might be helpful here—if AutoGoogler's search terms are focused enough, those could come through as pre-annotations. i.e. a query looking like arrest records in pittsburgh, pa could create pre-annotation. Unless there's some other reason AutoGoogler is not a candidate that I'm not expecting!

@maxachis
Copy link
Collaborator Author

maxachis commented Feb 7, 2025

we could limit things so that it's one agency per query, and multi-agency queries are just multiple queries, if that helps reduce complexity. seems like a fair enough tradeoff
Again, limitations might be helpful here—if AutoGoogler's search terms are focused enough, those could come through as pre-annotations. i.e. a query looking like arrest records in pittsburgh, pa could create pre-annotation. Unless there's some other reason AutoGoogler is not a candidate that I'm not expecting!

@josh-chamberlain So the way I'm envisioning it is already one agency per query, so we've got that covered. When I said "queries can theoretically be for multiple things", I meant that we might use the AutoGoogler for reasons beyond searching for agencies. Although on reflection that doesn't in theory preclude what you're talking about.

The main difficulty with AutoGoogler is that AutoGoogler can help us find a url for an agency, but there's not a reliable way to determine the name or location data from the URL or its HTML content. At minimum, we'd need to do some Natural Language Processing with a wide latitude for the different ways the data could be presented. That could be a fruitful direction to pursue, but it wouldn't be low-hanging fruit.

Narrowing the search queries could help, but would still present uncertainty: If I search for "Loremtown, Ipsum County, Massachusetts Police Department", and I look at the first result, how do I know that it is a homepage for the relevant police department? How do I determine what the name is so I can look it up in our database? It's not an impossible task, but like I said, not low-hanging fruit.

@maxachis
Copy link
Collaborator Author

maxachis commented Feb 7, 2025

Design Thinking

Higher Level Thinking

Table Design

Because we're using multiple databases, and content from Data Sources DB (users, agencies) is used in Source Collector DB (in this use case alone, for annotations), I had to figure out how to handle that. I considered a few options:

  1. Use PostgreSQL's Foreign Data Wrapper (FDW) to query DS from SC. Rejected because this would be hell to test in isolation.
  2. Query for agency suggestions via /match and store each suggestion individually for a given URL. Rejected because the same agency could be used for multiple URLs, so this would be wasteful.
  3. Just copy our agency and user data completely over to SC from DS. Rejected because a most users and many agencies will never be involved, so this would be wasteful.
  4. Query agencies initially via /match and store agencies we don't yet have in a secondary table in SC, which are then used when referencing agency suggestions. Effectively like a cache.

I opted for 4. This does introduce some data redundancy and the risk of data becoming stale. But agencies and users are unlikely to remain relatively static, and we can figure out ways to deal with staleness later, as we've refined the prototype.

Handling New Agencies

If an agency cannot be determined, we have a candidate for submitting a new agency. This part is not well-defined, but my sense is that we would need to simply submit a new agency, have that be approved, and then pull up that agency in the match endpoint.

Table Design

agencies

A partial mirror/cache of agency data.

  • agency_id: Deliberately not named id to emphasize that it is not an autogenerated row but in fact a reference to the agency in DS. Primary key with uniqueness enforced
  • name
  • state: if applicable
  • county: if applicable
  • locality: if applicable
  • updated_at: Indicated when the agency was retrieved/populated

confirmed_url_agency

Used when an agency has been confirmed associated with a url.

  • id: Auto-generated link ID
  • agency_id: Foreign key to local agencies table
  • url_id: Foreign key to url table
  • Unique constraint on (agency_id, url_id)

automated_url_agency_suggestions

Suggestions automatically retrieved for a URL

  • id: Auto-generated ID
  • agency_id: Foreign key to local agencies table, denoting the suggested agency
  • url_id: Foreign key to url
  • is_unknown: True if auto-identifier cannot determine prospective agencies, null otherwise
  • Unique constraint on (agency_id, url_id)

user_url_agency_suggestions

User suggestions for an agency for a URL

  • id: Auto-generated ID
  • agency_id: Foreign key to local agencies table, denoting the suggested agency. Nullable if none found.
  • url_id: Foreign key to url
  • user_id: Foreign key referencing DS Database's user table (not a local user table, which does not exist)
  • is_new: True if user believes the url is for a new agency, null otherwise
  • Unique constraint on (agency_id, url_id, user_id)

Workflow

  1. The Agency Identification Task identifies URLs without automated agency suggestions and retrieves potential agencies.
    A. If an agency is 100% determined through this process, the agency-url link is confirmed (via its presence in the confirmed_url_agency table
    B. Otherwise, a suite of agency auto-suggestions (up to 10) are provided for the URL, or it is listed as unknown.
  2. Users will receive agencies they have not yet provided a suggested agency for
    A. Users will either mark one of the provided suggestions, provide a new suggestion, or mark the agency as NEW.
    B. Agencies marked NEW will need to be handled in a separate workflow -- that is beyond the scope of this
  3. Later parts of the workflow will be done in a future issue

Future Developments

  • Will need to make an issue for handling regular database synchronization to attend to stale data. Possibly a scheduled background job
  • Determine, in coordination with possible chances to the DS App, how to create new agencies and then assign them to agencies given a NEW suggestion

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants