Skip to content

Add SLoC (Source Lines of Code) metric to versions #11453

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

Turbo87
Copy link
Member

@Turbo87 Turbo87 commented Jun 27, 2025

This PR introduces basic source code analysis for newly published versions. A new crates_io_linecount workspace crate uses the tokei crate to analyze source files during the publish process. The system collects language breakdowns and line count statistics, storing them as JSON in a new linecounts column on the versions table.

The analysis runs during tarball processing and excludes test directories and non-programming files. All existing functionality remains unchanged, with the new column being optional for backward compatibility.

Note that this is only the first step in a series of pull requests. The follow-up PRs will:

  • implement a background job to backfill the existing versions
  • adjust the API responses to expose the data
  • show the total SLoC count in the crate sidebar of the website

Turbo87 added 4 commits June 27, 2025 15:13
This commit adds a new `JSONB` column called `linecounts` to the versions table to store Source Lines of Code statistics for each crate version. The column stores language breakdown and totals as structured `JSON` data, enabling flexible schema evolution without requiring additional migrations.

The database schema and test snapshots are updated accordingly to reflect this new column structure.
This introduces a new workspace crate that provides line counting functionality using `tokei`. The crate includes `LinecountStats` and `LanguageStats` data structures for storing results, along with core analysis functions for processing file contents.

The implementation includes language filtering to exclude non-programming files and path filtering to skip test and example directories. Comprehensive test coverage is provided with `insta` snapshots to ensure reliable functionality.

This crate provides the foundation for adding SLOC metrics to crates.io by offering a clean, testable interface for analyzing source code statistics.
This adds the `linecounts` field to both the `Version` struct and `NewVersion` builder. The field stores linecount data as `JSON`, following the established pattern for flexible schema evolution without requiring additional migrations.

The `linecounts` field is `Optional` to handle existing versions that don't have this data, and will be populated for new versions during the publish process. This design ensures backward compatibility while enabling rich source code metrics for future crate versions.
This enhances the tarball processing pipeline to include SLOC analysis by adding `crates_io_linecount` dependency to the tarball processing crate and extending the `TarballInfo` struct with a `linecount_stats` field.

The integration occurs seamlessly during tarball file processing, where each qualifying source file is analyzed and its statistics are accumulated. All tarball processing test snapshots are updated to include linecount data, demonstrating the feature works correctly across various crate structures.

The integration preserves existing functionality while adding minimal overhead to the tarball validation and processing pipeline.
@Turbo87 Turbo87 added C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works A-backend ⚙️ labels Jun 27, 2025
@Turbo87 Turbo87 moved this to For next meeting in crates.io team meetings Jun 27, 2025
This modifies the publish endpoint to extract and store linecount statistics by extracting linecount data from tarball processing results and serializing the stats to `JSON` for database storage. The linecount data is then passed to the `NewVersion` builder for persistence.

All publish-related test snapshots are updated to include linecount data, demonstrating that the integration works correctly across various publish scenarios. The implementation maintains backward compatibility with null linecount values for any edge cases.
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
A-backend ⚙️ C-enhancement ✨ Category: Adding new behavior or a change to the way an existing feature works
Projects
Status: For next meeting
Development

Successfully merging this pull request may close these issues.

1 participant