Only include rules with diagnostics in SARIF metadata #13268

RussellLuo · 2024-09-06T12:02:38Z

This PR can be considered as a follow-up of #13180.

Motivation

The current Ruff's SARIF output contains all the 800+ rules, the vast majority of which have no relevance to the errors in results. IMO, this results in a large output and a lot of noise.

Proposal

Per the SARIF example, other linters such as ESLint only emits rules relevant to the errors in results. I think this could be an approach worth considering.

{
  "version": "2.1.0",
  "$schema": "http://json.schemastore.org/sarif-2.1.0-rtm.4",
  "runs": [
    {
      "tool": {
        "driver": {
          "name": "ESLint",
          "informationUri": "https://eslint.org",
          "rules": [
            {
              "id": "no-unused-vars",
              "shortDescription": {
                "text": "disallow unused variables"
              },
              "helpUri": "https://eslint.org/docs/rules/no-unused-vars",
              "properties": {
                "category": "Variables"
              }
            }
          ]
        }
      },
      "artifacts": [
        {
          "location": {
            "uri": "file:///C:/dev/sarif/sarif-tutorials/samples/Introduction/simple-example.js"
          }
        }
      ],
      "results": [
        {
          "level": "error",
          "message": {
            "text": "'x' is assigned a value but never used."
          },
          "locations": [
            {
              "physicalLocation": {
                "artifactLocation": {
                  "uri": "file:///C:/dev/sarif/sarif-tutorials/samples/Introduction/simple-example.js",
                  "index": 0
                },
                "region": {
                  "startLine": 1,
                  "startColumn": 5
                }
              }
            }
          ],
          "ruleId": "no-unused-vars",
          "ruleIndex": 0
        }
      ]
    }
  ]
}

github-actions · 2024-09-06T12:22:49Z

`ruff-ecosystem` results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

MichaReiser

Thanks for the follow up.

I'm not sure if reducing the rules to the reported tools is the right behavior. According to the spec

A toolComponent object MAY contain a property named rules whose value is an array of zero or more unique (§3.7.3) reportingDescriptor objects (§3.49) each of which provides information about an analysis rule supported by the tool component.

The way I understand this is that the rules table should list all rules supported by the tool and not just rules seen in this analysis run. Do you have a resource that states if it is okay for the rules table to only include the reported rules?

RussellLuo · 2024-09-09T00:00:56Z

@MichaReiser Thanks for the reply!

I have found two resources that might be relevant:

Rule metadata

Rule metadata is optional. An analysis tool can choose not to include it at all, to include metadata for only those rules that are relevant to the results, or to include metadata for all rules known to the tool.13
Appendix E. (Informative) Locating rule and notification metadata

The SARIF format allows rule and notification metadata to be included in a SARIF log file (see §3.19.23 and §3.19.24). A SARIF log file does not need to include any metadata. This raises the questions of when metadata should be included in a log file, and how to locate the metadata if it is not included in the log file.

Metadata should be included in a log file in the following circumstances:

· The log file is intended to be viewed in a tool such as a log file viewer that needs to display metadata related to each result or notification even when the tool is not connected to a network.

· The log file is intended to be uploaded to a result management system which requires information about every rule specified by every result, and which might not have prior knowledge of the rules specified by the results in this log file.

· Neither of the above applies, but the increased log file size due to the metadata is not considered significant.

To summarize, including only the rules relevant to the results or including all rules is okay, but IMO the latter would result in a large output.

MichaReiser

Thanks. These two sections are extremely helpful.

From the Appendix E:

Metadata should be included in a log file in the following circumstances:

Our challenge is that we can't make assumptions about the use case and I don't fully understand the following scenario:

The log file is intended to be uploaded to a result management system which requires information about every rule specified by every result, and which might not have prior knowledge of the rules specified by the results in this log file.

What do you make of this scenario?

Either way. I'm slightly leaning towards moving forward with this until someone comes up with a use case where they need all metadata.

MichaReiser · 2024-09-09T11:26:43Z

crates/ruff_linter/src/message/sarif.rs

@@ -27,6 +28,13 @@ impl Emitter for SarifEmitter {
            .map(SarifResult::from_message)
            .collect::<Result<Vec<_>>>()?;

+        let mut rule_ids = HashSet::new();


We can simplify this to

Suggested change

let mut rule_ids = HashSet::new();

let unique_rules: HashSet<_> = results.iter().filter_map(|result| result.rule).collect();

let mut rules: Vec<SarifRule> = unique_rules.into_iter().map(SarifRule::from).collect();

rules.sort_by(|a, b| a.code.cmp(&b.code));

and then use rules on line 46. This avoids allocating unnecessary Strings

Wow, this is fantastic!

(This PR is my second time writing Rust code. Learned a lot!)

RussellLuo · 2024-09-09T12:42:53Z

What do you make of this scenario?

To answer this question, let me analyze the second section (from spec) step by step.

The log file is intended to be viewed in a tool such as a log file viewer that needs to display metadata related to each result or notification even when the tool is not connected to a network.

Regarding "display metadata related to each result", my understanding is that rules are primarily prepared for the results.

The log file is intended to be uploaded to a result management system which requires information about every rule specified by every result, and which might not have prior knowledge of the rules specified by the results in this log file.

From "requires information about every rule specified by every result", again, I think rules are mainly used as additional details about the corresponding results.

Neither of the above applies, but the increased log file size due to the metadata is not considered significant.

I'm currently using Ruff through MegaLinter, which provides a webhook mechanism to send the SARIF output. Therefore, I'm concerned about the size of the HTTP request body (of the webhook).

MichaReiser · 2024-09-09T13:39:29Z

I'm currently using Ruff through MegaLinter, which provides a webhook mechanism to send the SARIF output. Therefore, I'm concerned about the size of the HTTP request body (of the webhook).

I wouldn't be too concerned about that. I would suspect that the JSON file, even with all rule metadata, is still less than 1 MB, and HTTP compression should bring that down to 100 KB.

RussellLuo · 2024-09-09T14:13:49Z

Given this code snippet:

# test.py
import os

def greet():
    print('hello {name}'.format())

I just performed a quick test:

# Using the official Ruff
$ ruff check --output-format=sarif test.py > official_result.json
$ du -sh official_result.json 
2.0M	official_result.json
# Using my local Ruff
$ ./target/debug/ruff check --output-format=sarif test.py > local_result.json
$ du -sh local_result.json   
8.0K	local_result.json

Overall, I agree with you that the size of the request body is not a big deal (although smaller is perhaps better). To my understanding of the spec, as described above, I think the rules are primarily used in conjunction with results. I'm not sure if there are scenarios where all rules are needed.

MichaReiser · 2024-09-09T21:24:28Z

Thanks for making this contribution. This does sound reasonable to me and we can always revert back to the previous behavior if there's an actual use case for it.

## Summary Follow-up from #13268, this PR updates the test case to use `assert_snapshot` now that the output is limited to only include the rules with diagnostics. ## Test Plan `cargo insta test`

RussellLuo added 2 commits September 6, 2024 19:40

Remove unnecessary SARIF rules

9e8e6b9

Tweaks

325b534

MichaReiser reviewed Sep 8, 2024

View reviewed changes

MichaReiser added the cli Related to the command-line interface label Sep 8, 2024

MichaReiser approved these changes Sep 9, 2024

View reviewed changes

Simplify code as suggested by Micha

c5d7f1d

MichaReiser changed the title ~~Remove unnecessary SARIF rules~~ Only include rules with diagnostics in SARIF metadata Sep 9, 2024

MichaReiser merged commit 5ef6979 into astral-sh:main Sep 9, 2024
20 checks passed

RussellLuo deleted the remove-unnecessary-sarif-rules branch September 10, 2024 00:36

dhruvmanila mentioned this pull request Sep 13, 2024

Create insta snapshot for SARIF output #13345

Merged

BrewTestBot mentioned this pull request Sep 13, 2024

ruff 0.6.5 Homebrew/homebrew-core#189287

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only include rules with diagnostics in SARIF metadata #13268

Only include rules with diagnostics in SARIF metadata #13268

RussellLuo commented Sep 6, 2024

github-actions bot commented Sep 6, 2024 •

edited

Loading

MichaReiser left a comment •

edited

Loading

RussellLuo commented Sep 9, 2024

MichaReiser left a comment

MichaReiser Sep 9, 2024

RussellLuo Sep 9, 2024

RussellLuo commented Sep 9, 2024

MichaReiser commented Sep 9, 2024

RussellLuo commented Sep 9, 2024

MichaReiser commented Sep 9, 2024

-        let mut rule_ids = HashSet::new();
+				let unique_rules: HashSet<_> = results.iter().filter_map(|result| result.rule).collect();
+        let mut rules: Vec<SarifRule> = unique_rules.into_iter().map(SarifRule::from).collect();
+        rules.sort_by(|a, b| a.code.cmp(&b.code));

Only include rules with diagnostics in SARIF metadata #13268

Only include rules with diagnostics in SARIF metadata #13268

Conversation

RussellLuo commented Sep 6, 2024

Motivation

Proposal

github-actions bot commented Sep 6, 2024 • edited Loading

ruff-ecosystem results

Linter (stable)

Linter (preview)

MichaReiser left a comment • edited Loading

Choose a reason for hiding this comment

RussellLuo commented Sep 9, 2024

MichaReiser left a comment

Choose a reason for hiding this comment

MichaReiser Sep 9, 2024

Choose a reason for hiding this comment

RussellLuo Sep 9, 2024

Choose a reason for hiding this comment

RussellLuo commented Sep 9, 2024

MichaReiser commented Sep 9, 2024

RussellLuo commented Sep 9, 2024

MichaReiser commented Sep 9, 2024

github-actions bot commented Sep 6, 2024 •

edited

Loading

`ruff-ecosystem` results

MichaReiser left a comment •

edited

Loading