Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Enhance ATS Optimizer with Improved Keyword Extraction and Semantic Analysis #19

Merged
merged 1 commit into from
Feb 21, 2025

Conversation

DavidOsipov
Copy link
Owner

This pull request enhances the ATS optimizer by refining keyword extraction, improving semantic analysis, and updating the configuration for better performance. Key changes include:

  • Config Updates:

    • Enabled semantic_validation by default and raised similarity_threshold to 0.65 for more accurate categorization.
    • Limited ngram_range and whitelist_ngram_range to [1, 2] for focused, concise keyword extraction.
  • Script Enhancements:

    • Integrated skills_whitelist into the spaCy entity ruler as SKILL entities, ensuring multi-word skills (e.g., "machine learning") are preserved intact.
    • Improved tokenization and n-gram generation to filter out noise (single-letter tokens, stop words), enhancing TF-IDF reliability.
    • Refined extract_keywords to combine SKILL entities with tokenized keywords, boosting accuracy.
    • Strengthened _create_tfidf_matrix with pre-vectorization validation and debugging for better traceability.
    • Added sentencizer to spaCy pipeline for consistent sentence segmentation.

Benefits:

  • More precise keyword extraction, especially for technical skills.
  • Enhanced semantic filtering reduces irrelevant keywords.
  • Improved robustness and debuggability of the analysis process.

Testing:

  • Verified with sample job descriptions; confirmed whitelisted skills are preserved and TF-IDF scores align with expectations.
  • Recommend testing with diverse job data to ensure compatibility.

Reviewer Notes:

  • Please review the entity ruler integration and keyword filtering logic.
  • Suggest any additional skills to add to the whitelist if needed.

Closes #<issue_number> (if applicable).

This commit introduces several enhancements to the ATS optimizer script and configuration:

1. **Configuration Updates**:
   - Added `semantic_validation: True` to enable semantic filtering of keywords by default.
   - Increased `similarity_threshold` from 0.6 to 0.65 for stricter semantic categorization.
   - Adjusted `ngram_range` and `whitelist_ngram_range` from `[1, 3]` to `[1, 2]` to focus on shorter, more precise phrases.

2. **Script Improvements**:
   - **Entity Ruler Enhancement**: Added whitelisted phrases from `skills_whitelist` as `SKILL` entities in the spaCy pipeline, preserving multi-word skills during tokenization.
   - **Keyword Extraction**:
     - Updated `_process_doc_tokens` to prioritize `SKILL` entities, ensuring they are preserved as whole phrases before tokenizing remaining text.
     - Improved `_generate_ngrams` to filter out single-letter tokens and stop words, enhancing TF-IDF accuracy.
     - Refined `extract_keywords` to integrate `SKILL` entity extraction with regular tokenization, combining both for robust keyword lists.
   - **TF-IDF Matrix**:
     - Enhanced `_create_tfidf_matrix` to validate keyword sets, removing invalid tokens (e.g., single-letter words) before vectorization.
     - Added debug logging for validated keyword sets.
   - **Model Loading**: Added `sentencizer` to the spaCy pipeline in `_try_load_model` for consistent sentence boundary detection.
   - **Minor Fixes**: Removed unused `Pool` import from multiprocessing, streamlined imports.

These changes improve keyword precision, preserve critical multi-word phrases, and enhance semantic analysis, making the optimizer more effective for ATS systems.
@DavidOsipov DavidOsipov self-assigned this Feb 21, 2025
@DavidOsipov DavidOsipov merged commit eb87090 into main Feb 21, 2025
6 of 9 checks passed
@DavidOsipov DavidOsipov deleted the DavidOsipov-patch-1 branch February 21, 2025 09:22
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant