Enhance ATS Optimizer with Improved Keyword Extraction and Semantic Analysis #19

DavidOsipov · 2025-02-21T09:18:45Z

This pull request enhances the ATS optimizer by refining keyword extraction, improving semantic analysis, and updating the configuration for better performance. Key changes include:

Config Updates:
- Enabled semantic_validation by default and raised similarity_threshold to 0.65 for more accurate categorization.
- Limited ngram_range and whitelist_ngram_range to [1, 2] for focused, concise keyword extraction.
Script Enhancements:
- Integrated skills_whitelist into the spaCy entity ruler as SKILL entities, ensuring multi-word skills (e.g., "machine learning") are preserved intact.
- Improved tokenization and n-gram generation to filter out noise (single-letter tokens, stop words), enhancing TF-IDF reliability.
- Refined extract_keywords to combine SKILL entities with tokenized keywords, boosting accuracy.
- Strengthened _create_tfidf_matrix with pre-vectorization validation and debugging for better traceability.
- Added sentencizer to spaCy pipeline for consistent sentence segmentation.

Benefits:

More precise keyword extraction, especially for technical skills.
Enhanced semantic filtering reduces irrelevant keywords.
Improved robustness and debuggability of the analysis process.

Testing:

Verified with sample job descriptions; confirmed whitelisted skills are preserved and TF-IDF scores align with expectations.
Recommend testing with diverse job data to ensure compatibility.

Reviewer Notes:

Please review the entity ruler integration and keyword filtering logic.
Suggest any additional skills to add to the whitelist if needed.

Closes #<issue_number> (if applicable).

This commit introduces several enhancements to the ATS optimizer script and configuration: 1. **Configuration Updates**: - Added `semantic_validation: True` to enable semantic filtering of keywords by default. - Increased `similarity_threshold` from 0.6 to 0.65 for stricter semantic categorization. - Adjusted `ngram_range` and `whitelist_ngram_range` from `[1, 3]` to `[1, 2]` to focus on shorter, more precise phrases. 2. **Script Improvements**: - **Entity Ruler Enhancement**: Added whitelisted phrases from `skills_whitelist` as `SKILL` entities in the spaCy pipeline, preserving multi-word skills during tokenization. - **Keyword Extraction**: - Updated `_process_doc_tokens` to prioritize `SKILL` entities, ensuring they are preserved as whole phrases before tokenizing remaining text. - Improved `_generate_ngrams` to filter out single-letter tokens and stop words, enhancing TF-IDF accuracy. - Refined `extract_keywords` to integrate `SKILL` entity extraction with regular tokenization, combining both for robust keyword lists. - **TF-IDF Matrix**: - Enhanced `_create_tfidf_matrix` to validate keyword sets, removing invalid tokens (e.g., single-letter words) before vectorization. - Added debug logging for validated keyword sets. - **Model Loading**: Added `sentencizer` to the spaCy pipeline in `_try_load_model` for consistent sentence boundary detection. - **Minor Fixes**: Removed unused `Pool` import from multiprocessing, streamlined imports. These changes improve keyword precision, preserve critical multi-word phrases, and enhance semantic analysis, making the optimizer more effective for ATS systems.

DavidOsipov self-assigned this Feb 21, 2025

DavidOsipov merged commit eb87090 into main Feb 21, 2025
6 of 9 checks passed

DavidOsipov deleted the DavidOsipov-patch-1 branch February 21, 2025 09:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance ATS Optimizer with Improved Keyword Extraction and Semantic Analysis #19

Enhance ATS Optimizer with Improved Keyword Extraction and Semantic Analysis #19

DavidOsipov commented Feb 21, 2025

Enhance ATS Optimizer with Improved Keyword Extraction and Semantic Analysis #19

Enhance ATS Optimizer with Improved Keyword Extraction and Semantic Analysis #19

Conversation

DavidOsipov commented Feb 21, 2025