0.26
Pre-release
Pre-release
Version 0.26.0 (03/03/2025) - Alpha Version
This release represents a major overhaul of Keywords4CV, focusing on robustness, performance, extensibility, and maintainability. It introduces significant architectural changes, improved error handling, advanced caching, and a comprehensive metrics reporting system. This version also includes several bug fixes and new features. Users upgrading from v0.24 or earlier should carefully review the updated config.yaml
and documentation, as there are breaking changes.
Highlights:
- Modular Architecture: The codebase has been reorganized into several modules for improved clarity and maintainability.
- Enhanced Configuration: Configuration is now validated using both
schema
(for YAML structure) andpydantic
(for runtime validation and type coercion). This prevents many common configuration errors. - Robust Error Handling: Custom exception classes (
ConfigError
,InputValidationError
,DataIntegrityError
,APIError
,NetworkError
,AuthenticationError
) are used throughout, providing more informative error messages and improved program stability. - Advanced Caching: A flexible caching system (
CacheManager
,MemoryCacheBackend
) is implemented, significantly improving performance for repeated operations. Caching is used for:- Text preprocessing
- Term vectorization
- Fuzzy matching (using an enhanced BK-tree)
- Trigram optimization
- API calls (with Time-To-Live support)
- Semantic validation
- Keyword Canonicalization: A new
KeywordCanonicalizer
class handles deduplication, abbreviation expansion, and embedding-based clustering of keywords, reducing redundancy and improving accuracy. - Improved Fuzzy Matching: An
EnhancedBKTree
implementation provides optimized fuzzy matching with adaptive caching. - Semantic Validation: A
SemanticValidator
class performs POS tagging, semantic similarity checks (using spaCy embeddings), and negative keyword filtering. The context window for semantic validation is now configurable. - Optimized Multiprocessing: spaCy model loading is optimized for multiprocessing environments, avoiding redundant model loading in worker processes. The number of worker processes is dynamically adjusted based on system resources.
- Trigram Optimization: A
TrigramOptimizer
class pre-computes and caches trigrams to speed up keyword extraction. - Adaptive Chunking: A
SmartChunker
class uses a Q-learning approach to dynamically adjust the chunk size based on dataset statistics and system resource usage. - Automatic Parameter Tuning: An
AutoTuner
class adjusts processing parameters (e.g., chunk size, POS processing mode) based on performance metrics. - Comprehensive Metrics Reporting: A new metrics reporting system (
metrics_evaluation.py
,metrics_reporter.py
) generates detailed reports, including:- Precision, recall, and F1-score (against both original and expanded skill sets).
- Category coverage.
- Mean Average Precision (mAP).
- Visualizations of keyword distributions, category distributions, and skill coverage.
- HTML reports summarizing metrics and visualizations.
- Intermediate Saving and Loading: Results are saved to disk at configurable intervals, allowing for recovery from interruptions and analysis of large datasets. Checksum verification ensures data integrity. Supports multiple output formats (Feather, JSONL, JSON).
- API Integration: Synonym generation can now use an external API (with caching, retries, timeouts, and a circuit breaker).
- Sentence Extraction: Added custom rules to split sentences.
New Features:
- Keyword Canonicalization: Deduplication, abbreviation expansion, and clustering of similar keywords.
- API-based Synonym Generation: Option to fetch synonyms from an external API.
- Configurable Context Window: Control the size of the context window used for semantic validation.
- Fuzzy Matching Before/After Semantic Validation: Option to perform fuzzy matching before or after semantic filtering.
- Enhanced BK-Tree: Optimized fuzzy matching with caching.
- Comprehensive Metrics Reporting: Detailed reports with visualizations.
- Adaptive Chunking: Dynamic adjustment of chunk size based on data and resources.
- Automatic Parameter Tuning: Automatic adjustment of processing parameters.
- Intermediate Saving/Loading: Support for saving and resuming analysis.
- Checksum Verification: Ensures data integrity for intermediate files.
- Configurable Text Encoding: Specify the text encoding to be used.
- Section-Based Analysis: Extract keywords from specific sections of text (using configurable section headings).
- Negative Keywords: Define a list of keywords to always exclude.
Improvements:
- Error Handling: Extensive use of custom exceptions and more informative error messages.
- Memory Management: Significant improvements to reduce memory usage, including:
- Using generators where possible.
- Explicitly deleting objects.
- Using
HashingVectorizer
for TF-IDF. - Adaptive cache sizing.
- Chunking and processing data in smaller batches.
- Performance: Numerous optimizations, including caching, multiprocessing, and trigram optimization.
- Correctness: Improved handling of edge cases and potential errors.
- Extensibility: Modular design makes it easier to add new features.
- Maintainability: Code is better organized, documented, and easier to understand.
- Configuration: Pydantic models provide strong validation and type coercion.
- Input Sanitization: Handles numeric titles and empty descriptions based on configuration.
- Stop Words Handling: Improved handling of stop words, including adding and excluding words via configuration.
- Logging: More detailed and informative logging using
structlog
.
Bug Fixes:
- Fixed several issues related to index out-of-bounds errors.
- Fixed issues with incorrect handling of empty chunks.
- Fixed issues with inconsistent case handling.
- Fixed issues with TF-IDF matrix creation.
- Fixed issues with intermediate file saving and loading.
- Fixed order of operations in keyword extraction (adding original skills before synonym generation).
- Fixed various other minor bugs and inconsistencies.
Breaking Changes:
- Configuration File: The structure of the
config.yaml
file has changed significantly. You will need to update your configuration file to match the new structure. See the updated documentation for details. - API: The
OptimizedATS
class initialization and theanalyze_jobs
method signature have changed. - Output: The structure of the output data may have changed slightly.
- Dependencies: New dependencies were added (
structlog
,tenacity
,xxhash
,pybktree
,rapidfuzz
,cachetools
,pyarrow
).
Upgrade Instructions:
- Update Dependencies: Install the new dependencies:
pip install structlog tenacity xxhash pybktree rapidfuzz cachetools pyarrow
- Update Configuration: Carefully review the new
config.yaml
file and update your existing configuration accordingly. Pay close attention to the new sections and options. - Update Code: If you have custom code that interacts with the
OptimizedATS
class, you may need to update it to reflect the changes in the API.