fix(cli): Allow diff scans to happen in dirty repos #7973

ajbt200128 · 2023-06-06T22:50:02Z

Users can now run semgrep <command> --baseline-commit COMMIT in an unclean working tree. Semgrep will now scan staged files in baseline mode, in addition to its normal functionality.

We probably want to add to the documentation somewhere that users should be able to put
ci --baseline-commit HEAD as args for pre-commit, to scan block based on CI rules and if they are in the blocking column. This will also report findings to the app, so --dry-run can be added to not do this.

This PR also adds a new pre-commit hook semgrep-ci which runs semgrep w/ the above args

PR checklist:

Purpose of the code is evident to future readers
Tests included or PR comment includes a reproducible test plan
Documentation is up-to-date
A changelog entry was added to changelog.d for any user-facing change
Change has no security implications (otherwise, ping security team)

If you're unsure about any of this, please see:

linear · 2023-06-06T22:50:04Z

PA-2795 semgrep ci scans staged changes

During a diff scan with a pre-commit hook - semgrep ci reports the following error:
[ERROR] Found pending changes in tracked files. Baseline scans runs require a clean git state.

This will always be the case for "pre-commit" scans.

Can we include the staged changes in the scan?

github-actions · 2023-06-06T23:04:09Z

🚫 The whole benchmark suite is too slow: 10.2% ( 1.102 s)

14 benchmarks, 10.2% slower on average.

Individual deviations greater than 20% from the baseline are reported. An individual performance degradation of over 30% or a global degradation of over 7% is an error and will block the pull request. See run output for full results ('Show all checks' > 'Tests / semgrep benchmark tests' 'Details').

emjin · 2023-06-12T18:21:38Z

There was a thread earlier where someone posted about a tool they made to do something similar. Would be great to respond in that thread once this is out to update them! https://semgrep.slack.com/archives/C01LNF69F9B/p1682841222100279

emjin

Thanks for the tests! I'm approving this from my knowledge, but it might be worth checking with @underyx and @brendongo if there are any potentially bad states that we should be testing. For instance, I don't know if there are any places where making a temporary directory could be problematic.

I would also appreciate some profiling of baseline scanning on large repos. You can clone https://github.com/returntocorp/examples-annotations or https://github.com/returntocorp/stress-test-monorepo for the purpose. If you do the former, examples-annotations/repos/java/SQLi/cve-repos/OpenClinica or the same path with dotCMS are both good examples. I'm concerned that this will slow down developers who work on monorepos. Depending on the results of that profiling, we might want to think about maintaining a faster alternative

.pre-commit-hooks.yaml

cli/tests/e2e/snapshots/test_baseline/test_staged_changes/error.txt

cli/src/semgrep/git.py

ajbt200128 · 2023-06-13T18:44:28Z

I would also appreciate some profiling of baseline scanning on large repos.

Tried examples-annotations

examples-annotations/repos/java/SQLi/cve-repos/OpenClinica
- v1.25.0 1.30s user 0.21s system 43% cpu 3.495 total
- this branch 1.18s user 0.46s system 40% cpu 3.986 total
.../dotCMS
- v1.25.0 1.28s user 0.22s system 48% cpu 3.083 total
- this branch 1.18s user 0.45s system 52% cpu 3.121 total

This makes sense to what I expected. We're making a new directory copying over everything, so the extra overhead is coming from the system calls for those. My guess is that the warning about the benchmarks above is probably safe to ignore, as for small repos (the tests above) the time spent scanning <<< time spent in system calls, but for large repos it's the opposite

emjin · 2023-06-13T20:46:06Z

The benchmark message is happening in a bunch of repos :( We must have introduced a slight slowdown earlier

emjin · 2023-06-13T20:50:42Z

Hm, ok, that's not awful, though definitely not great. I'm not a fan of waiting 3 or 4 seconds for pre-commit to complete. And it's quite common for our customers to have monorepos. I don't see an obvious solution, but it's definitely something I expect we'll have to revisit. Maybe with osemgrep we can copy only the files that are affected, for example.

ajbt200128 · 2023-06-13T21:15:10Z

Yea, I think with osemgrep there'll be a lot more options for speeding this up, but for now this feels fine to me

aryx · 2023-06-14T07:47:38Z

yes I think git worktree is a pretty expensive operation on a big monorepo that we want to avoid.
What we could do instead is just iterate over all the files modified between the current state and the baseline, get their old content (there is a command in git to show the content of a file at a certain SHA), and run semgrep on that file.

Users can now run `semgrep <command> --baseline-commit COMMIT` in an unclean working tree. Semgrep will now scan staged files in baseline mode, in addition to its normal functionality. We probably want to add to the documentation somewhere that users should be able to put `ci --baseline-commit HEAD` as args for pre-commit, to scan block based on CI rules and if they are in the blocking column. This will also report findings to the app, so `--dry-run` can be added to not do this. This PR also adds a new pre-commit hook `semgrep-ci` which runs semgrep w/ the above args PR checklist: - [X] Purpose of the code is [evident to future readers](https://semgrep.dev/docs/contributing/contributing-code/#explaining-code) - [X] Tests included or PR comment includes a reproducible test plan - [ ] Documentation is up-to-date - [X] A changelog entry was [added to changelog.d](https://semgrep.dev/docs/contributing/contributing-code/#adding-a-changelog-entry) for any user-facing change - [X] Change has no security implications (otherwise, ping security team) If you're unsure about any of this, please see: - [Contribution guidelines](https://semgrep.dev/docs/contributing/contributing-code)! - [One of the more specific guides located here](https://semgrep.dev/docs/contributing/contributing/)

aryx · 2023-06-28T11:30:03Z

@ajbt200128 @milanwilliams you might want to comment more on
navhits/semgrep-precommit#1

ajbt200128 added 2 commits June 6, 2023 18:36

Allow diff scans to happen in dirty repos

bf3a91d

Fix changelog

9756832

ajbt200128 requested review from aryx and underyx June 6, 2023 22:50

ajbt200128 changed the title ~~Allow diff scans to happen in dirty repos~~ fix(cli): Allow diff scans to happen in dirty repos Jun 6, 2023

Add new pre-commit hook

8e386af

linear bot mentioned this pull request Jun 7, 2023

Add semgrep-ci example for pre-commit semgrep/semgrep-docs#1048

Merged

5 tasks

ajbt200128 requested review from a team, IagoAbal and emjin and removed request for a team June 7, 2023 17:43

emjin approved these changes Jun 12, 2023

View reviewed changes

Modify comments log output

73655bc

ajbt200128 merged commit a6ae72f into develop Jun 13, 2023

ajbt200128 deleted the austin/pa-2795 branch June 13, 2023 21:15

aryx mentioned this pull request Jun 28, 2023

Still needed after latest semgrep? navhits/semgrep-precommit#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cli): Allow diff scans to happen in dirty repos #7973

fix(cli): Allow diff scans to happen in dirty repos #7973

ajbt200128 commented Jun 6, 2023 •

edited

Loading

linear bot commented Jun 6, 2023

github-actions bot commented Jun 6, 2023

emjin commented Jun 12, 2023

emjin left a comment

ajbt200128 commented Jun 13, 2023

emjin commented Jun 13, 2023 •

edited

Loading

emjin commented Jun 13, 2023

ajbt200128 commented Jun 13, 2023

aryx commented Jun 14, 2023

aryx commented Jun 28, 2023

fix(cli): Allow diff scans to happen in dirty repos #7973

fix(cli): Allow diff scans to happen in dirty repos #7973

Conversation

ajbt200128 commented Jun 6, 2023 • edited Loading

linear bot commented Jun 6, 2023

github-actions bot commented Jun 6, 2023

emjin commented Jun 12, 2023

emjin left a comment

Choose a reason for hiding this comment

ajbt200128 commented Jun 13, 2023

emjin commented Jun 13, 2023 • edited Loading

emjin commented Jun 13, 2023

ajbt200128 commented Jun 13, 2023

aryx commented Jun 14, 2023

aryx commented Jun 28, 2023

ajbt200128 commented Jun 6, 2023 •

edited

Loading

emjin commented Jun 13, 2023 •

edited

Loading