Enhance ATS keyword extraction and analysis #9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Refactor and Enhance ATS Keyword Optimizer Script
This PR introduces significant improvements and refactoring to the ATS Keyword Optimizer script, enhancing its performance, reliability, and configurability. The changes span across the core script, configuration, and testing framework, leading to a more robust and user-friendly tool.
Summary of Changes:
Code Refactoring (
keywords4cv.py
):Expanded Configuration (
config.yaml
):section_headings
: Define custom section headings to improve keyword extraction context.spacy_model
: Allow users to specify different spaCy language models for customized NLP processing.cache_size
: Configure the size of the text preprocessing cache for performance tuning.whitelist_ngram_range
: Set n-gram range specifically for whitelist matching.timeout
: Implement a timeout mechanism for long-running analyses.model_download_retries
: Configure retries for spaCy model downloads to handle network issues.auto_chunk_threshold
,memory_threshold
,max_memory_percent
,max_workers
,min_chunk_size
,max_chunk_size
: Fine-grained control over chunking behavior for memory management and performance optimization.max_retries
: Set the maximum number of retries for the entire analysis process in case of transient errors.strict_mode
: Enable strict mode to halt analysis on any exception, or disable for more lenient error handling.semantic_validation
: Toggle semantic validation of extracted keywords for improved accuracy.similarity_threshold
: Adjust the similarity threshold for semantic categorization.text_encoding
: Specify the text encoding for input job descriptions.Comprehensive Unit Tests (
test_keywords4cv.py
):pytest
to ensure code reliability and prevent regressions.cosine_similarity
,ensure_nltk_resources
,load_job_data
,parse_arguments
,save_results
).EnhancedTextPreprocessor
: Testing preprocessing steps, caching, and batch processing.AdvancedKeywordExtractor
: Testing keyword extraction, synonym generation, n-gram handling, section extraction, and semantic filtering.ATSOptimizer
: Testing core analysis workflow, configuration loading, input validation, chunking logic, and error handling.Dependency Updates (
requirements.txt
):Benefits of these changes:
Testing:
test_keywords4cv.py
have been executed and passed successfully, ensuring the stability and correctness of the changes.Please review and merge this PR to incorporate these significant improvements into the main branch.