-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Mc agency homepage searcher #54
Mc agency homepage searcher #54
Conversation
…raise runtime error
- Add psycopg2-binary - Add huggingface-hub
- Add condition for when there is nothing to return, such as in an INSERT statement
- correct column name in SQL_UPDATE_CACHE - Update SQL_GET_AGENCIES_WITHOUT_HOMEPAGE_URLS to not return entries which already exist in the cache.
- Bug was causing two newlines to appear in windows.
- This should help prevent an entire set of searches from being lost if an error occurs in one
…archer # Conflicts: # requirements.txt
This is a nice way to do maintenance on the agencies table, too—especially when combined with the duplicate checker. In addition to agency homepages, but maybe more complicated, we could use data sources with |
I'm down for this! I recommend making it as a separate issue with more details, and I can eventually get to that! For now, I can begin working on developing tests as well as a Github Action for this so that it can be run once per day. |
The CSV temporary file in the 'write_to_temporary_csv' function now has utf-8 encoding for better compatibility. Exception handling is incorporated to prevent crashes while writing rows to CSV. Print statements were added to log the number of search results obtained.
The main function of the application was moved from homepage_searcher.py to main.py to improve code organization. As part of the changes, the if __name__ == "__main__" clause was transferred to main.py. Also, environment variables were transferred to the GoogleSearcher, DBManager and HuggingFaceAPIManager constructors in main.py.
A new method `get_search_string` has been added to the AgencyInfo class in the agency_info.py file. This method helps to construct the search string for search engines, improving the mechanism of searching agency information. Additionally, unnecessary blank lines at the beginning of the file have been removed for cleaner code formatting.
The GoogleSearcher class in the google_searcher.py file has been refactored to improve clarity and functionality. Detailed explanations for methods and attributes have been added, and the daily quota restriction handling has been more effectively implemented with the addition of a new QuotaExceededError. Additionally, the "Quota exceeded" check was-removed during HTTP error handling.
The code in homepage_searcher.py has been streamlined to improve readability and efficiency. The `search_until_quota_exceeded` method was renamed to `search_until_limit_reached` to more accurately describe its behavior, and the creation of `AgencyInfo` objects was moved to its own method for better abstraction. Thorough comments were added for each method to provide clear explanations. Additionally, error handling was enhanced to include the new QuotaExceededError.
The error message that is raised when either 'api_key' or 'cse_id' variables is 'None' has been clarified in 'google_searcher.py'. Previously, the error message stated, "Custom search API key and CSE ID required", but this has been changed to "Custom search API key and CSE ID cannot be None." to provide additional precision.
Unit tests have been added to test the functionality of the GoogleSearcher class in the agency_homepage_searcher module. These tests cover initialization, search functionality, and error handling, including specific tests for exceeding API quota and runtime errors.
Revised the success message after search to include the unique dataset URL on HuggingFace. Instead of just stating the local file path, it will now give the exact URL where the dataset has been uploaded on HuggingFace's platform for a more straightforward navigation to the uploaded datasets.
…ngFace Refactored the exception handling in the csv writer process to use the standard Exception class. Altered 'get_agencies_without_homepage_urls' to return a list and reflect this change in variable naming. Added an explicit upload to HuggingFace function and a success message to print the HuggingFace dataset URL after upload. This enhances user experience by providing direct access to the uploaded datasets.
Expanded the unit tests for `homepage_searcher` and `google_searcher` modules, now covering more scenarios and conditions. These include testing the `search_and_upload`, `upload_to_huggingface` and multiple `search` methods and the handling of exceptions. In addition to that, variables have been checked to ensure they were called with the expected arguments, enhancing the reliability and robustness of the codebase.
Modified the error message in homepage_searcher.py to include both error type and error message for more specific debugging. This change works to better identify the nature of the issues when they occur during the runtime and helps in diagnosing and rectifying problems more efficiently.
Removed unnecessary imports in the file test_agency_homepage_searcher_unit.py. The cleanup adds to the readability of the file and supports more efficient debugging by avoiding unnecessary complexity.
This commit introduces a new test file, test_agency_homepage_searcher_integration.py. The file contains an integration test for the HomepageSearcher class in the agency_homepage_searcher module, validating expected interactions with the Google API, database manager, and HuggingFace API manager.
Added a few lines of code to set the working directory to the root of the repository. This modification aims to fix the persistent import issues occurring in the 'agency_homepage_searcher' script.
Added regular expressions library and a cleanup step in the search string generation method. This enhancement is made to the 'agency_homepage_searcher' script to remove unwanted characters like brackets, parentheses, and quotes from the search strings.
Upgraded the pytest-postgresql plugin version and switched to the recommended psycopg[binary] package. These updates improve the testing process and facilitate proper PostgreSQL integration.
Introduced a new enumeration SearchResultEnum for better search responses handling. Modified the SQL_UPDATE_CACHE query to include the new 'search_result' column, allowing for better tracking of search results in the database.
Switched from psycopg2 to psycopg library in the database manager. This change affects the connection establishment, fetching data from the database and handling database programming errors within the DBManager class.
Added a new requirements file to define the dependencies for the agency_homepage_searcher action. This includes specific versions for python-dotenv, google-api-python-client, psycopg2-binary, and huggingface-hub libraries.
Added pytest_postgresql import to test_agency_homepage_searcher_integration.py for implementing integration testing. Readjusted code formatting to adhere to style guidelines, made changes to enhance readability. Included instructions and example for a PostgreSQL docker setup for testing, which needs to be moved to a README file in the future.
Removed unnecessary blank lines to clean up the unit testing code for test agency homepage searcher. Extended the unit tests by adding a test case to verify that disallowed characters are being stripped correctly from the agency name search string.
Introduces documentation for the Agency Homepage Searcher module, its functionality, environment setup, and execution. The README details the procedure of filling missing agency homepage data, requirements for execution, and gives a short guide on running the script.
This file currently exists as a stand-in for a Github Action yaml file for automatically running the agency_homepage_searcher
Closing this in favor of #74 |
Fixes
#53 - "Alternative way of getting more urls: Automated Search Engine Calls"
Description
This performs several related actions:
PDAP/possible_homepage_urls
datasetTesting
[UNDER CONSTRUCTION]