Skip to content

Releases: DavidOsipov/Keywords4CV

0.26

03 Mar 15:39
1abc860
Compare
Choose a tag to compare
0.26 Pre-release
Pre-release

Version 0.26.0 (03/03/2025) - Alpha Version

This release represents a major overhaul of Keywords4CV, focusing on robustness, performance, extensibility, and maintainability. It introduces significant architectural changes, improved error handling, advanced caching, and a comprehensive metrics reporting system. This version also includes several bug fixes and new features. Users upgrading from v0.24 or earlier should carefully review the updated config.yaml and documentation, as there are breaking changes.

Highlights:

  • Modular Architecture: The codebase has been reorganized into several modules for improved clarity and maintainability.
  • Enhanced Configuration: Configuration is now validated using both schema (for YAML structure) and pydantic (for runtime validation and type coercion). This prevents many common configuration errors.
  • Robust Error Handling: Custom exception classes (ConfigError, InputValidationError, DataIntegrityError, APIError, NetworkError, AuthenticationError) are used throughout, providing more informative error messages and improved program stability.
  • Advanced Caching: A flexible caching system (CacheManager, MemoryCacheBackend) is implemented, significantly improving performance for repeated operations. Caching is used for:
    • Text preprocessing
    • Term vectorization
    • Fuzzy matching (using an enhanced BK-tree)
    • Trigram optimization
    • API calls (with Time-To-Live support)
    • Semantic validation
  • Keyword Canonicalization: A new KeywordCanonicalizer class handles deduplication, abbreviation expansion, and embedding-based clustering of keywords, reducing redundancy and improving accuracy.
  • Improved Fuzzy Matching: An EnhancedBKTree implementation provides optimized fuzzy matching with adaptive caching.
  • Semantic Validation: A SemanticValidator class performs POS tagging, semantic similarity checks (using spaCy embeddings), and negative keyword filtering. The context window for semantic validation is now configurable.
  • Optimized Multiprocessing: spaCy model loading is optimized for multiprocessing environments, avoiding redundant model loading in worker processes. The number of worker processes is dynamically adjusted based on system resources.
  • Trigram Optimization: A TrigramOptimizer class pre-computes and caches trigrams to speed up keyword extraction.
  • Adaptive Chunking: A SmartChunker class uses a Q-learning approach to dynamically adjust the chunk size based on dataset statistics and system resource usage.
  • Automatic Parameter Tuning: An AutoTuner class adjusts processing parameters (e.g., chunk size, POS processing mode) based on performance metrics.
  • Comprehensive Metrics Reporting: A new metrics reporting system (metrics_evaluation.py, metrics_reporter.py) generates detailed reports, including:
    • Precision, recall, and F1-score (against both original and expanded skill sets).
    • Category coverage.
    • Mean Average Precision (mAP).
    • Visualizations of keyword distributions, category distributions, and skill coverage.
    • HTML reports summarizing metrics and visualizations.
  • Intermediate Saving and Loading: Results are saved to disk at configurable intervals, allowing for recovery from interruptions and analysis of large datasets. Checksum verification ensures data integrity. Supports multiple output formats (Feather, JSONL, JSON).
  • API Integration: Synonym generation can now use an external API (with caching, retries, timeouts, and a circuit breaker).
  • Sentence Extraction: Added custom rules to split sentences.

New Features:

  • Keyword Canonicalization: Deduplication, abbreviation expansion, and clustering of similar keywords.
  • API-based Synonym Generation: Option to fetch synonyms from an external API.
  • Configurable Context Window: Control the size of the context window used for semantic validation.
  • Fuzzy Matching Before/After Semantic Validation: Option to perform fuzzy matching before or after semantic filtering.
  • Enhanced BK-Tree: Optimized fuzzy matching with caching.
  • Comprehensive Metrics Reporting: Detailed reports with visualizations.
  • Adaptive Chunking: Dynamic adjustment of chunk size based on data and resources.
  • Automatic Parameter Tuning: Automatic adjustment of processing parameters.
  • Intermediate Saving/Loading: Support for saving and resuming analysis.
  • Checksum Verification: Ensures data integrity for intermediate files.
  • Configurable Text Encoding: Specify the text encoding to be used.
  • Section-Based Analysis: Extract keywords from specific sections of text (using configurable section headings).
  • Negative Keywords: Define a list of keywords to always exclude.

Improvements:

  • Error Handling: Extensive use of custom exceptions and more informative error messages.
  • Memory Management: Significant improvements to reduce memory usage, including:
    • Using generators where possible.
    • Explicitly deleting objects.
    • Using HashingVectorizer for TF-IDF.
    • Adaptive cache sizing.
    • Chunking and processing data in smaller batches.
  • Performance: Numerous optimizations, including caching, multiprocessing, and trigram optimization.
  • Correctness: Improved handling of edge cases and potential errors.
  • Extensibility: Modular design makes it easier to add new features.
  • Maintainability: Code is better organized, documented, and easier to understand.
  • Configuration: Pydantic models provide strong validation and type coercion.
  • Input Sanitization: Handles numeric titles and empty descriptions based on configuration.
  • Stop Words Handling: Improved handling of stop words, including adding and excluding words via configuration.
  • Logging: More detailed and informative logging using structlog.

Bug Fixes:

  • Fixed several issues related to index out-of-bounds errors.
  • Fixed issues with incorrect handling of empty chunks.
  • Fixed issues with inconsistent case handling.
  • Fixed issues with TF-IDF matrix creation.
  • Fixed issues with intermediate file saving and loading.
  • Fixed order of operations in keyword extraction (adding original skills before synonym generation).
  • Fixed various other minor bugs and inconsistencies.

Breaking Changes:

  • Configuration File: The structure of the config.yaml file has changed significantly. You will need to update your configuration file to match the new structure. See the updated documentation for details.
  • API: The OptimizedATS class initialization and the analyze_jobs method signature have changed.
  • Output: The structure of the output data may have changed slightly.
  • Dependencies: New dependencies were added (structlog, tenacity, xxhash, pybktree, rapidfuzz, cachetools, pyarrow).

Upgrade Instructions:

  1. Update Dependencies: Install the new dependencies:
    pip install structlog tenacity xxhash pybktree rapidfuzz cachetools pyarrow
  2. Update Configuration: Carefully review the new config.yaml file and update your existing configuration accordingly. Pay close attention to the new sections and options.
  3. Update Code: If you have custom code that interacts with the OptimizedATS class, you may need to update it to reflect the changes in the API.

Version 0.24 (Alpha version)

28 Feb 19:13
1619c8b
Compare
Choose a tag to compare
Pre-release

Major Enhancements and New Features:

  • Comprehensive Configuration Validation: Implemented a robust two-stage configuration validation system. This uses the schema library for initial YAML structure validation (ensuring correct keys, data types, and relationships) and Pydantic for runtime validation and type coercion. This significantly improves the reliability and user-friendliness of the script by catching configuration errors early and providing informative error messages. The config_validation.py module encapsulates this logic. Pydantic models are used extensively throughout, ensuring type safety and data integrity.
  • Advanced Keyword Extraction and Filtering:
    • Fuzzy Matching Integration: Integrated rapidfuzz for fuzzy matching of keywords against a whitelist (or the expanded set of skills). This allows for variations in spelling and phrasing, improving recall. Configuration options include the matching algorithm (ratio, partial_ratio, token_sort_ratio, token_set_ratio, WRatio), minimum similarity score, and allowed POS tags.
    • Configurable Processing Order: Added the fuzzy_before_semantic option (text_processing section in config.yaml). This allows users to choose whether fuzzy matching is applied before or after semantic validation, providing greater flexibility in the keyword extraction pipeline.
    • Phrase-Level Synonym Handling: Introduced support for phrase-level synonyms (e.g., "product management" synonyms: ["product leadership", "product ownership"]). Synonyms can be loaded from a static JSON file (phrase_synonyms_path) or fetched from an API (api_endpoint, api_key). This significantly expands the ability to capture relevant skills expressed in different ways. The SynonymEntry Pydantic model enforces data integrity for static synonyms.
    • Improved Contextual Validation: Enhanced semantic validation using a configurable context window (context_window_size). The script now considers the surrounding sentences (respecting paragraph breaks) to determine if a keyword is used in the relevant context. This reduces false positives. The sentence splitting logic now handles bullet points and numbered lists more robustly.
    • POS Tag Filtering: Added more granular control over POS tag filtering with the pos_filter and allowed_pos options. This allows users to specify which parts of speech are considered for keyword extraction and fuzzy matching.
    • Trigram Optimization: Implemented a TrigramOptimizer to improve the efficiency of n-gram generation and candidate selection. This uses an LRU cache to store frequently used trigrams, reducing redundant computations.
    • Dynamic N-gram Generation: The _generate_ngrams function is now cached and handles edge cases more robustly (e.g., invalid input n).
  • Adaptive Chunking and Parameter Tuning:
    • Smart Chunker: Introduced a SmartChunker class that uses a Q-learning algorithm to dynamically adjust the chunk size based on dataset statistics (average job description length, number of texts) and system resource usage (memory). This helps to optimize performance and prevent out-of-memory errors.
    • Auto Tuner: Added an AutoTuner class that automatically adjusts parameters (e.g., chunk_size, pos_processing) based on metrics (recall, memory usage, processing time) and the trigram cache hit rate. This allows the script to adapt to different datasets and hardware configurations.
  • Intermediate Result Saving and Checkpointing:
    • Configurable Intermediate Saving: Implemented robust intermediate saving of results (summary and detailed scores) to disk. This allows for resuming processing after interruptions and prevents data loss in case of errors. The intermediate_save section in config.yaml controls the format (feather, jsonl, json), save interval, working directory, and cleanup behavior.
    • Data Integrity Checks: Added checksum verification (using xxhash) for intermediate files to ensure data integrity. A checksum manifest file (checksums.jsonl) is created and used to verify the integrity of the saved data.
    • Streaming Data Aggregation: Implemented a streaming data aggregation approach for combining intermediate results. This allows the script to handle very large datasets that don't fit in memory. The _aggregate_results function handles both lists and generators of DataFrames.
    • Schema Validation and Appending: The code now validates the schema of intermediate files (especially for feather and jsonl) and it is able to append new chunks to already existing files.
  • Enhanced Error Handling and Logging:
    • Custom Exceptions: Defined custom exceptions (ConfigError, InputValidationError, CriticalFailureError, AggregationError, DataIntegrityError) for more specific error handling and reporting.
    • Comprehensive Error Handling: Added extensive error handling throughout the script, including checks for invalid input, file I/O errors, API errors, memory errors, and data integrity issues.
    • Improved Logging: Enhanced logging to provide more informative messages about the script's progress, warnings, and errors. This includes logging of configuration parameters, dataset statistics, processing times, memory usage, and cache hit rates.
    • Strict Mode: Added a strict_mode option (in the config.yaml) that, when enabled, causes the script to raise exceptions on certain errors (e.g., invalid input, empty descriptions) instead of logging warnings and continuing.
  • Code Refactoring and Optimization:
    • Modular Design: Refactored the code into smaller, more manageable classes and functions (e.g., ParallelProcessor, TrigramOptimizer, SmartChunker, AutoTuner).
    • Type Hinting: Added type hints throughout the code to improve readability and maintainability.
    • Memory Management: Implemented various memory management techniques, including explicit garbage collection (gc.collect()), releasing spaCy Doc objects after processing, and using generators for streaming data processing.
    • Caching: Used lru_cache and LRUCache to cache frequently used computations (e.g., term vectorization, n-gram generation, fuzzy matching).
    • Parallel Processing: Leveraged concurrent.futures.ProcessPoolExecutor for parallel processing of job descriptions, significantly improving performance on multi-core systems.
    • Dynamic Batch Size: The batch size for spaCy processing is now dynamically calculated, considering available memory and the configured memory_scaling_factor.
    • GPU Memory Check: Added an optional check for available GPU memory (if use_gpu and check_gpu_memory are enabled). If GPU memory is low, it can either disable GPU usage or reduce the number of workers.
  • Refactored TF-IDF Matrix Creation: The TF-IDF matrix creation is now more efficient and robust. The vectorizer is fitted only once (with optional sampling for large datasets), and keyword sets are pre-validated.
  • Consistent Hashing: The caching system now uses a cache_salt to ensure that cache keys are unique across different runs and configurations. The salt can be set via an environment variable (K4CV_CACHE_SALT) or in the config.yaml file.
  • Improved Keyword Categorization: Keyword categorization logic is enhanced, and a configurable default_category is used for terms that cannot be categorized. The categorization_cache_size allows controlling the cache size for term categorization.

Bug Fixes:

  • Fixed several issues related to data loading, validation, and processing.
  • Improved error handling and logging in various parts of the script.
  • Addressed potential memory leaks and improved overall memory management.
  • Corrected issues with chunk size calculation and Q-table updates.
  • Fixed inconsistencies in the application of the whitelist boost.
  • Resolved issues with intermediate file saving and loading.
  • Addressed errors during vectorization and score calculations.

Known Issues:

  • NOTE! At this point of time, the script doesn't work. This release aims to introduct critical architectural changes.

Dependencies:

  • nltk
  • pandas
  • spacy (>=3.0.0 recommended)
  • scikit-learn
  • pyyaml
  • psutil
  • hashlib (replaced with xxhash)
  • requests
  • rapidfuzz
  • srsly
  • xxhash
  • cachetools
  • pydantic (>=2.0 recommended, but v1 is supported)
  • schema
  • pyarrow
  • numpy
  • itertools

Future Improvements:

  • [List any planned future improvements.]
  • Explore the use of Dask for distributed processing.
  • Continue to refine the reinforcement learning algorithms for adaptive parameter tuning.
  • Add more comprehensive unit tests.
  • Improve documentation and user guide.
  • Consider adding support for other input formats (e.g., CSV, text files).
  • Explore the use of more advanced NLP techniques (e.g., transformer-based models).

How to Upgrade:

  1. Backup your existing config.yaml and synonyms.json files.
  2. Replace the old script files (keywords4cv_*.py.txt, exceptions.py.txt, config_validation.py.txt) with the new versions.
  3. Carefully review the updated config.yaml.truncated.txt file. There are many new configuration options and changes to existing ones. You will need to merge your existing configuration with the new template. Pay close attention to the following sections:
    • validation
    • text_processing (especially phrase_synonym_source, phrase_synonyms_path, api_endpoint, api_key, fuzzy_before_semantic)
    • whitelist (especially fuzzy_matching)
    • hardware_limits
    • optimization
    • caching (especially cache_salt)
      ...
Read more

Version 0.09 (Alpha) - 21/02/2025

21 Feb 09:35
263d089
Compare
Choose a tag to compare
Pre-release

This release builds on the foundation laid in 0.05 and 0.051, introducing significant enhancements to keyword extraction, semantic analysis, and overall robustness of the ATS Optimizer. While progress has been made on precision and functionality, some known issues remain unresolved, and new challenges have emerged.

Major Changes

Enhanced Keyword Extraction

  • Entity Ruler Integration: Added whitelisted phrases from skills_whitelist as SKILL entities in the spaCy pipeline, preserving multi-word skills (e.g., "machine learning") during tokenization.
  • Improved N-Gram Generation: Updated _generate_ngrams to filter out single-letter tokens and stop words, ensuring cleaner and more relevant keyword sets for TF-IDF analysis.
  • Refined Keyword Extraction: Enhanced extract_keywords to combine preserved SKILL entities with tokenized keywords, improving accuracy for technical terms.

Semantic Analysis Improvements

  • Enabled by Default: Set semantic_validation: True in the config to filter keywords based on context, reducing irrelevant terms.
  • Stricter Similarity Threshold: Increased similarity_threshold from 0.6 to 0.65 for more precise semantic categorization.
  • N-Gram Range Adjustment: Reduced ngram_range and whitelist_ngram_range from [1, 3] to [1, 2] to focus on shorter, actionable phrases.

Robustness and Debugging

  • TF-IDF Matrix Validation: Improved _create_tfidf_matrix with pre-vectorization filtering of invalid tokens (e.g., single-letter words) and added debug logging for better traceability.
  • SpaCy Pipeline: Added sentencizer to the model loading process for consistent sentence boundary detection.
  • Code Cleanup: Removed unused Pool import from multiprocessing and streamlined imports for better maintainability.

New Features

  • Preservation of Whitelisted Skills: Multi-word skills from skills_whitelist are now recognized as entities, ensuring they remain intact in the output (e.g., "product management" instead of split terms).
  • Dynamic N-Gram Filtering: Automatically excludes noisy n-grams containing stop words or single characters.

Resolved Issues

  • None from previous versions fully resolved yet (see Known Issues).

Known Issues

  • [Critical, Unresolved from 0.05] Incorrectly Displayed Keywords in Excel: The Summary sheet shows single-word keywords (e.g., "science", "cross") with suspiciously uniform scores (e.g., 1.42192903 for multiple terms), indicating a potential scoring or aggregation issue. Multi-word phrases from skills_whitelist are not consistently appearing as expected.
  • [Critical, Unresolved from 0.05] Unreliable Unit Tests: The test suite remains untested and fails consistently, lacking coverage for critical components like keyword preservation and scoring.
  • [New, High Priority] Inconsistent Whitelist Application: The Detailed Scores sheet shows many keywords marked as In Whitelist: FALSE despite being in skills_whitelist (e.g., "product owner"), suggesting an issue with whitelist matching or entity recognition.
  • [New, Medium Priority] Low TF-IDF Variance: TF-IDF scores in the Detailed Scores sheet are often identical (e.g., 0.049574662), indicating potential issues with document differentiation or scoring normalization.

Sample Output Analysis

  • Summary Page: Keywords like "science", "cross", and "technical" have identical Total_Score (1.42192903) and Avg_Score (0.473976343) across 3 jobs, suggesting a possible bug in score calculation or keyword weighting.
  • Detailed Scores: Multi-word phrases (e.g., "product owner", "leveraging llms") appear, but their In Whitelist status is inconsistent, and scores are low (0.034702263), possibly due to TF-IDF dilution or whitelist boost not applying correctly.

Dependencies

  • nltk
  • numpy
  • pandas
  • spacy
  • scikit-learn
  • pyyaml
  • psutil
  • hashlib

Future Improvements

  • [High Priority] Fix Keyword Display and Scoring: Address the uniform scoring and missing multi-word keywords in the Excel output.
  • [High Priority] Overhaul Unit Tests: Develop comprehensive tests to validate entity recognition, whitelist application, and scoring accuracy.
  • [Medium Priority] Enhance Whitelist Boost: Ensure whitelist_boost (1.5) is consistently applied to whitelisted terms.
  • [Medium Priority] Optimize TF-IDF: Investigate low variance in TF-IDF scores and improve differentiation across documents.
  • Further refine semantic filtering for domain-specific terms.
  • Enhance logging to pinpoint scoring and categorization issues.

What's Changed

Full Changelog: 0.051...0.09

Updated dependencies

20 Feb 16:13
8f34c93
Compare
Choose a tag to compare
Updated dependencies Pre-release
Pre-release

What's Changed

Full Changelog: 0.05...0.051

0.05 (Alpha)

20 Feb 16:01
cfa5d23
Compare
Choose a tag to compare
0.05 (Alpha) Pre-release
Pre-release

Changelog

Version 0.05 (Alpha) - 20/02/2025

This release includes significant improvements to the keyword extraction and analysis pipeline, focusing on enhanced memory management, improved configuration handling, and more robust error handling. However, please note that Known Issue #1 and #2 from the previous version persist in this release and have not yet been resolved.

Major Changes:

  • Enhanced Memory Management:

    • Implemented chunking of job descriptions to process large datasets without exceeding memory limits. The chunk size is dynamically calculated based on available memory.
    • Added memory usage checks to proactively clear caches when memory usage is high.
    • Introduced options for managing and configuring memory usage to prevent out-of-memory errors
  • Improved Configuration:

    • Added section extraction capabilities: Extracts sections from job descriptions based on section headings (e.g., "Responsibilities," "Requirements").
    • Introduced fallback to a basic spaCy model with a sentencizer if the specified model cannot be loaded or downloaded.
    • Enhanced validation of input job descriptions, including length checks and invalid character removal.
    • Added a text_encoding configuration option to handle different character encodings gracefully.
  • More Robust Error Handling:

    • Implemented a retry mechanism for job analysis to handle transient errors.
    • Added strict mode to control whether exceptions are raised or gracefully handled.
    • Improved error handling during spaCy model loading and downloading.
  • Keyword Extraction Improvements:

    • Enhanced keyword extraction with semantic filtering based on context to improve accuracy.

New Features:

  • Dynamic Chunking: The system now automatically chunks large job description sets into smaller, manageable pieces.
  • Fallback spaCy Model: Added a fallback mechanism that loads a simpler spaCy model if the configured model fails to load.
  • Configuration Option Added text_encoding option to handle non-UTF-8 jobs descriptions encodings.

Code Quality:

  • Improved code structure and documentation.
  • Added a comprehensive test suite to ensure code reliability.
  • Added type hints, making the code easier to read and maintain.

Known Issues:

  1. [Critical, Unresolved] The final output in excel contains incorrectly displayed keywords.

  2. [Critical, Unresolved] The unittest isn't throughtly tested and contains various mistakes - it fails every time.

Dependencies:

  • nltk
  • numpy
  • pandas
  • spacy
  • scikit-learn
  • pyyaml
  • psutil
  • hashlib

Future Improvements:

  • [High Priority] Fix the incorrectly displayed keywords outputted in excel.
  • [High Priority] Fix the unittest to test all important parts, including the above, that are critical.
  • Further optimization of keyword extraction and scoring.
  • Improved handling of rare or domain-specific keywords.
  • Enhanced user interface.
  • Integration with other ATS systems.
  • More sophisticated synonym generation techniques.

Full Changelog: 0.01...0.05

What's Changed

New Contributors

Alpha version

19 Feb 18:14
0521f98
Compare
Choose a tag to compare
Alpha version Pre-release
Pre-release

Changelog

Version 0.01 (Alpha) - 19/02/2025

This is the initial Alpha release of the Job Keyword Analyzer. It provides a robust and optimized solution for extracting keywords from job descriptions, calculating TF-IDF scores, and categorizing keywords.

Major Features:

  • Keyword Extraction:
    • Extracts keywords from job descriptions using TF-IDF.
    • Supports configurable n-gram extraction (unigrams, bigrams, trigrams, etc.).
    • Includes a user-managed whitelist of skills and synonyms for improved accuracy.
    • Performs Named Entity Recognition (NER) to identify relevant entities (configurable).
    • Generates synonyms using WordNet and spaCy's lemmatization.
  • Keyword Categorization:
    • Categorizes keywords based on semantic similarity using spaCy's word embeddings.
    • Falls back to direct keyword matching for improved accuracy.
    • Uses a configurable similarity threshold.
  • Analysis and Output:
    • Calculates adjusted TF-IDF scores, incorporating whitelist boosts and term frequency.
    • Generates two output files:
      • A summary table with total TF-IDF, job count, and average TF-IDF for each keyword.
      • A detailed pivot table showing keyword scores for each job title.
    • Saves results to an Excel file.
  • Configuration and Customization:
    • Highly configurable via a config.yaml file:
      • skills_whitelist: List of important skills.
      • stop_words, stop_words_add, stop_words_exclude: Control over stop word removal.
      • ngram_range: Configurable range for n-gram extraction.
      • whitelist_ngram_range: Configurable range for n-gram extraction for the whitelist.
      • allowed_entity_types: Specify which NER entity types to extract.
      • keyword_categories: Define custom categories and associated keywords.
      • weighting: Adjust the weights for TF-IDF, frequency, and whitelist boosts.
      • spacy_model: Specify the spaCy model to use.
      • cache_size: Configure the size of the preprocessing cache.
      • max_desc_length: Maximum length of job descriptions.
      • min_desc_length: Minimum length of job descriptions.
      • min_jobs: Minimum number of job descriptions required.
      • similarity_threshold: Threshold for semantic similarity categorization.
      • timeout: Timeout for analysis.
  • Usability:
    • Command-line interface (using argparse) for easy execution.
    • Interactive whitelist management (add/remove skills and synonyms).
    • Comprehensive logging for debugging and monitoring.
  • Robustness and Optimization:
    • Extensive error handling with custom exception classes.
    • Input validation to prevent common errors.
    • Memory safety checks to avoid crashes on large inputs.
    • Analysis timeout to prevent indefinite execution.
    • Optimized text preprocessing with caching and batch processing.
    • Efficient synonym generation and n-gram extraction.

Dependencies:

  • Python 3.8+
  • nltk
  • numpy
  • pandas
  • spacy
  • scikit-learn
  • PyYAML
  • psutil
  • hashlib

Known Issues:

  • This is an Alpha release, so there may be undiscovered bugs.
  • Performance may vary depending on the size and complexity of the job descriptions.
  • The semantic similarity categorization relies on the quality of the spaCy word embeddings.

Future Improvements:

  • Further optimization of keyword extraction and scoring.
  • Improved handling of rare or domain-specific keywords.
  • Enhanced user interface.
  • Integration with other ATS systems.
  • More sophisticated synonym generation techniques.

How to Run:

  1. Install the required dependencies: pip install nltk numpy pandas spacy scikit-learn pyyaml psutil
  2. Create a config.yaml file to customize the analysis (see the documentation for details).
  3. Create a JSON file containing your job descriptions (see examples in the documentation).
  4. Run the script from the command line: python your_script_name.py -i input.json -o output.xlsx -c config.yaml

We welcome feedback and contributions!

Full Changelog: https://github.com/DavidOsipov/Keywords4Cv/commits/0.01