-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Integrate Agency Labeling Logic #139
Comments
So this task will have several components:
|
So I'll start off looking at step 1 of this process, for each of our collectors:
CKAN and MuckRock are candidates for automatic labeling (even as just a pre-annotation), while AutoGoogler and Common Crawler are not. |
we could limit things so that it's one agency per query, and multi-agency queries are just multiple queries, if that helps reduce complexity. seems like a fair enough tradeoff
Again, limitations might be helpful here—if AutoGoogler's search terms are focused enough, those could come through as pre-annotations. i.e. a query looking like |
@josh-chamberlain So the way I'm envisioning it is already one agency per query, so we've got that covered. When I said "queries can theoretically be for multiple things", I meant that we might use the AutoGoogler for reasons beyond searching for agencies. Although on reflection that doesn't in theory preclude what you're talking about. The main difficulty with AutoGoogler is that AutoGoogler can help us find a url for an agency, but there's not a reliable way to determine the name or location data from the URL or its HTML content. At minimum, we'd need to do some Natural Language Processing with a wide latitude for the different ways the data could be presented. That could be a fruitful direction to pursue, but it wouldn't be low-hanging fruit. Narrowing the search queries could help, but would still present uncertainty: If I search for "Loremtown, Ipsum County, Massachusetts Police Department", and I look at the first result, how do I know that it is a homepage for the relevant police department? How do I determine what the name is so I can look it up in our database? It's not an impossible task, but like I said, not low-hanging fruit. |
Design ThinkingHigher Level ThinkingTable DesignBecause we're using multiple databases, and content from Data Sources DB (users, agencies) is used in Source Collector DB (in this use case alone, for annotations), I had to figure out how to handle that. I considered a few options:
I opted for 4. This does introduce some data redundancy and the risk of data becoming stale. But agencies and users are unlikely to remain relatively static, and we can figure out ways to deal with staleness later, as we've refined the prototype. Handling New AgenciesIf an agency cannot be determined, we have a candidate for submitting a new agency. This part is not well-defined, but my sense is that we would need to simply submit a new agency, have that be approved, and then pull up that agency in the match endpoint. Table Design
|
Set up endpoint to pull URLs that did not have an agency identified by a collector, for manual labeling.
The text was updated successfully, but these errors were encountered: