Skip to content

0.26

Pre-release
Pre-release
Compare
Choose a tag to compare
@DavidOsipov DavidOsipov released this 03 Mar 15:39
· 8 commits to main since this release
1abc860

Version 0.26.0 (03/03/2025) - Alpha Version

This release represents a major overhaul of Keywords4CV, focusing on robustness, performance, extensibility, and maintainability. It introduces significant architectural changes, improved error handling, advanced caching, and a comprehensive metrics reporting system. This version also includes several bug fixes and new features. Users upgrading from v0.24 or earlier should carefully review the updated config.yaml and documentation, as there are breaking changes.

Highlights:

  • Modular Architecture: The codebase has been reorganized into several modules for improved clarity and maintainability.
  • Enhanced Configuration: Configuration is now validated using both schema (for YAML structure) and pydantic (for runtime validation and type coercion). This prevents many common configuration errors.
  • Robust Error Handling: Custom exception classes (ConfigError, InputValidationError, DataIntegrityError, APIError, NetworkError, AuthenticationError) are used throughout, providing more informative error messages and improved program stability.
  • Advanced Caching: A flexible caching system (CacheManager, MemoryCacheBackend) is implemented, significantly improving performance for repeated operations. Caching is used for:
    • Text preprocessing
    • Term vectorization
    • Fuzzy matching (using an enhanced BK-tree)
    • Trigram optimization
    • API calls (with Time-To-Live support)
    • Semantic validation
  • Keyword Canonicalization: A new KeywordCanonicalizer class handles deduplication, abbreviation expansion, and embedding-based clustering of keywords, reducing redundancy and improving accuracy.
  • Improved Fuzzy Matching: An EnhancedBKTree implementation provides optimized fuzzy matching with adaptive caching.
  • Semantic Validation: A SemanticValidator class performs POS tagging, semantic similarity checks (using spaCy embeddings), and negative keyword filtering. The context window for semantic validation is now configurable.
  • Optimized Multiprocessing: spaCy model loading is optimized for multiprocessing environments, avoiding redundant model loading in worker processes. The number of worker processes is dynamically adjusted based on system resources.
  • Trigram Optimization: A TrigramOptimizer class pre-computes and caches trigrams to speed up keyword extraction.
  • Adaptive Chunking: A SmartChunker class uses a Q-learning approach to dynamically adjust the chunk size based on dataset statistics and system resource usage.
  • Automatic Parameter Tuning: An AutoTuner class adjusts processing parameters (e.g., chunk size, POS processing mode) based on performance metrics.
  • Comprehensive Metrics Reporting: A new metrics reporting system (metrics_evaluation.py, metrics_reporter.py) generates detailed reports, including:
    • Precision, recall, and F1-score (against both original and expanded skill sets).
    • Category coverage.
    • Mean Average Precision (mAP).
    • Visualizations of keyword distributions, category distributions, and skill coverage.
    • HTML reports summarizing metrics and visualizations.
  • Intermediate Saving and Loading: Results are saved to disk at configurable intervals, allowing for recovery from interruptions and analysis of large datasets. Checksum verification ensures data integrity. Supports multiple output formats (Feather, JSONL, JSON).
  • API Integration: Synonym generation can now use an external API (with caching, retries, timeouts, and a circuit breaker).
  • Sentence Extraction: Added custom rules to split sentences.

New Features:

  • Keyword Canonicalization: Deduplication, abbreviation expansion, and clustering of similar keywords.
  • API-based Synonym Generation: Option to fetch synonyms from an external API.
  • Configurable Context Window: Control the size of the context window used for semantic validation.
  • Fuzzy Matching Before/After Semantic Validation: Option to perform fuzzy matching before or after semantic filtering.
  • Enhanced BK-Tree: Optimized fuzzy matching with caching.
  • Comprehensive Metrics Reporting: Detailed reports with visualizations.
  • Adaptive Chunking: Dynamic adjustment of chunk size based on data and resources.
  • Automatic Parameter Tuning: Automatic adjustment of processing parameters.
  • Intermediate Saving/Loading: Support for saving and resuming analysis.
  • Checksum Verification: Ensures data integrity for intermediate files.
  • Configurable Text Encoding: Specify the text encoding to be used.
  • Section-Based Analysis: Extract keywords from specific sections of text (using configurable section headings).
  • Negative Keywords: Define a list of keywords to always exclude.

Improvements:

  • Error Handling: Extensive use of custom exceptions and more informative error messages.
  • Memory Management: Significant improvements to reduce memory usage, including:
    • Using generators where possible.
    • Explicitly deleting objects.
    • Using HashingVectorizer for TF-IDF.
    • Adaptive cache sizing.
    • Chunking and processing data in smaller batches.
  • Performance: Numerous optimizations, including caching, multiprocessing, and trigram optimization.
  • Correctness: Improved handling of edge cases and potential errors.
  • Extensibility: Modular design makes it easier to add new features.
  • Maintainability: Code is better organized, documented, and easier to understand.
  • Configuration: Pydantic models provide strong validation and type coercion.
  • Input Sanitization: Handles numeric titles and empty descriptions based on configuration.
  • Stop Words Handling: Improved handling of stop words, including adding and excluding words via configuration.
  • Logging: More detailed and informative logging using structlog.

Bug Fixes:

  • Fixed several issues related to index out-of-bounds errors.
  • Fixed issues with incorrect handling of empty chunks.
  • Fixed issues with inconsistent case handling.
  • Fixed issues with TF-IDF matrix creation.
  • Fixed issues with intermediate file saving and loading.
  • Fixed order of operations in keyword extraction (adding original skills before synonym generation).
  • Fixed various other minor bugs and inconsistencies.

Breaking Changes:

  • Configuration File: The structure of the config.yaml file has changed significantly. You will need to update your configuration file to match the new structure. See the updated documentation for details.
  • API: The OptimizedATS class initialization and the analyze_jobs method signature have changed.
  • Output: The structure of the output data may have changed slightly.
  • Dependencies: New dependencies were added (structlog, tenacity, xxhash, pybktree, rapidfuzz, cachetools, pyarrow).

Upgrade Instructions:

  1. Update Dependencies: Install the new dependencies:
    pip install structlog tenacity xxhash pybktree rapidfuzz cachetools pyarrow
  2. Update Configuration: Carefully review the new config.yaml file and update your existing configuration accordingly. Pay close attention to the new sections and options.
  3. Update Code: If you have custom code that interacts with the OptimizedATS class, you may need to update it to reflect the changes in the API.