Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Mc agency homepage searcher #54

Conversation

maxachis
Copy link
Collaborator

@maxachis maxachis commented Mar 20, 2024

Fixes

#53 - "Alternative way of getting more urls: Automated Search Engine Calls"

Description

This performs several related actions:

  • Obtains all agencies from the PUBLIC.AGENCIES table in the PDAP Digital Ocean DB which both do not have a homepage_url and which have not already been searched for (as determined by whether their unique identifier is present in the PUBLIC.AGENCY_URL_SEARCH_CACHE table
  • Generates a search query for each agency based on information present in the database, and performs an automated Google Search API Call. At the free tier, up to 100 such queries can be made each day.
  • Saves 10 of the results for each agency to a csv.
  • Once all searches have been completed (either 100 or whenever the search quota is reached), these entries are put in a CSV, along with identifying information, and uploaded to the huggingface PDAP/possible_homepage_urls dataset
  • Following this, the agencies which have been searched for are added to PUBLIC_AGENCY_URL_SEARCH_CACHE

Testing

[UNDER CONSTRUCTION]

maxachis added 29 commits March 20, 2024 08:37
- Add psycopg2-binary
- Add huggingface-hub
- Add condition for when there is nothing to return, such as in an INSERT statement
- correct column name in SQL_UPDATE_CACHE
- Update SQL_GET_AGENCIES_WITHOUT_HOMEPAGE_URLS to not return entries which already exist in the cache.
- Bug was causing two newlines to appear in windows.
- This should help prevent an entire set of searches from being lost if an error occurs in one
@josh-chamberlain
Copy link
Contributor

This is a nice way to do maintenance on the agencies table, too—especially when combined with the duplicate checker.

In addition to agency homepages, but maybe more complicated, we could use data sources with record_type="List of Data Sources" to generate possible URLs

@maxachis
Copy link
Collaborator Author

This is a nice way to do maintenance on the agencies table, too—especially when combined with the duplicate checker.

In addition to agency homepages, but maybe more complicated, we could use data sources with record_type="List of Data Sources" to generate possible URLs

I'm down for this! I recommend making it as a separate issue with more details, and I can eventually get to that!

For now, I can begin working on developing tests as well as a Github Action for this so that it can be run once per day.

maxachis added 14 commits March 30, 2024 20:21
The CSV temporary file in the 'write_to_temporary_csv' function now has utf-8 encoding for better compatibility. Exception handling is incorporated to prevent crashes while writing rows to CSV. Print statements were added to log the number of search results obtained.
The main function of the application was moved from homepage_searcher.py to main.py to improve code organization. As part of the changes, the if __name__ == "__main__" clause was transferred to main.py. Also, environment variables were transferred to the GoogleSearcher, DBManager and HuggingFaceAPIManager constructors in main.py.
A new method `get_search_string` has been added to the AgencyInfo class in the agency_info.py file. This method helps to construct the search string for search engines, improving the mechanism of searching agency information. Additionally, unnecessary blank lines at the beginning of the file have been removed for cleaner code formatting.
The GoogleSearcher class in the google_searcher.py file has been refactored to improve clarity and functionality. Detailed explanations for methods and attributes have been added, and the daily quota restriction handling has been more effectively implemented with the addition of a new QuotaExceededError. Additionally, the "Quota exceeded" check was-removed during HTTP error handling.
The code in homepage_searcher.py has been streamlined to improve readability and efficiency. The `search_until_quota_exceeded` method was renamed to `search_until_limit_reached` to more accurately describe its behavior, and the creation of `AgencyInfo` objects was moved to its own method for better abstraction. Thorough comments were added for each method to provide clear explanations. Additionally, error handling was enhanced to include the new QuotaExceededError.
The error message that is raised when either 'api_key' or 'cse_id' variables is 'None' has been clarified in 'google_searcher.py'. Previously, the error message stated, "Custom search API key and CSE ID required", but this has been changed to "Custom search API key and CSE ID cannot be None." to provide additional precision.
Unit tests have been added to test the functionality of the GoogleSearcher class in the agency_homepage_searcher module. These tests cover initialization, search functionality, and error handling, including specific tests for exceeding API quota and runtime errors.
Revised the success message after search to include the unique dataset URL on HuggingFace. Instead of just stating the local file path, it will now give the exact URL where the dataset has been uploaded on HuggingFace's platform for a more straightforward navigation to the uploaded datasets.
…ngFace

Refactored the exception handling in the csv writer process to use the standard Exception class. Altered 'get_agencies_without_homepage_urls' to return a list and reflect this change in variable naming. Added an explicit upload to HuggingFace function and a success message to print the HuggingFace dataset URL after upload. This enhances user experience by providing direct access to the uploaded datasets.
Expanded the unit tests for `homepage_searcher` and `google_searcher` modules, now covering more scenarios and conditions. These include testing the `search_and_upload`, `upload_to_huggingface` and multiple `search` methods and the handling of exceptions. In addition to that, variables have been checked to ensure they were called with the expected arguments, enhancing the reliability and robustness of the codebase.
Modified the error message in homepage_searcher.py to include both error type and error message for more specific debugging. This change works to better identify the nature of the issues when they occur during the runtime and helps in diagnosing and rectifying problems more efficiently.
Removed unnecessary imports in the file test_agency_homepage_searcher_unit.py. The cleanup adds to the readability of the file and supports more efficient debugging by avoiding unnecessary complexity.
This commit introduces a new test file, test_agency_homepage_searcher_integration.py. The file contains an integration test for the HomepageSearcher class in the agency_homepage_searcher module, validating expected interactions with the Google API, database manager, and HuggingFace API manager.
Added a few lines of code to set the working directory to the root of the repository. This modification aims to fix the persistent import issues occurring in the 'agency_homepage_searcher' script.
maxachis added 9 commits April 2, 2024 17:45
Added regular expressions library and a cleanup step in the search string generation method. This enhancement is made to the 'agency_homepage_searcher' script to remove unwanted characters like brackets, parentheses, and quotes from the search strings.
Upgraded the pytest-postgresql plugin version and switched to the recommended psycopg[binary] package. These updates improve the testing process and facilitate proper PostgreSQL integration.
Introduced a new enumeration SearchResultEnum for better search responses handling. Modified the SQL_UPDATE_CACHE query to include the new 'search_result' column, allowing for better tracking of search results in the database.
Switched from psycopg2 to psycopg library in the database manager. This change affects the connection establishment, fetching data from the database and handling database programming errors within the DBManager class.
Added a new requirements file to define the dependencies for the agency_homepage_searcher action. This includes specific versions for python-dotenv, google-api-python-client, psycopg2-binary, and huggingface-hub libraries.
Added pytest_postgresql import to test_agency_homepage_searcher_integration.py for implementing integration testing. Readjusted code formatting to adhere to style guidelines, made changes to enhance readability. Included instructions and example for a PostgreSQL docker setup for testing, which needs to be moved to a README file in the future.
Removed unnecessary blank lines to clean up the unit testing code for test agency homepage searcher. Extended the unit tests by adding a test case to verify that disallowed characters are being stripped correctly from the agency name search string.
Introduces documentation for the Agency Homepage Searcher module, its functionality, environment setup, and execution. The README details the procedure of filling missing agency homepage data, requirements for execution, and gives a short guide on running the script.
This file currently exists as a stand-in for a Github Action yaml file for automatically running the agency_homepage_searcher
@maxachis
Copy link
Collaborator Author

Closing this in favor of #74

@maxachis maxachis closed this Apr 10, 2024
@maxachis maxachis deleted the mc_agency_homepage_searcher branch April 10, 2024 21:03
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants