-
Notifications
You must be signed in to change notification settings - Fork 624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(cli): Allow diff scans to happen in dirty repos #7973
Conversation
PA-2795 semgrep ci scans staged changes
During a diff scan with a pre-commit hook - semgrep ci reports the following error: This will always be the case for "pre-commit" scans. Can we include the staged changes in the scan? |
🚫 The whole benchmark suite is too slow: 10.2% ( 1.102 s) 14 benchmarks, 10.2% slower on average. Individual deviations greater than 20% from the baseline are reported. An individual performance degradation of over 30% or a global degradation of over 7% is an error and will block the pull request. See run output for full results ('Show all checks' > 'Tests / semgrep benchmark tests' 'Details'). |
There was a thread earlier where someone posted about a tool they made to do something similar. Would be great to respond in that thread once this is out to update them! https://semgrep.slack.com/archives/C01LNF69F9B/p1682841222100279 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the tests! I'm approving this from my knowledge, but it might be worth checking with @underyx and @brendongo if there are any potentially bad states that we should be testing. For instance, I don't know if there are any places where making a temporary directory could be problematic.
I would also appreciate some profiling of baseline scanning on large repos. You can clone https://github.com/returntocorp/examples-annotations or https://github.com/returntocorp/stress-test-monorepo for the purpose. If you do the former, examples-annotations/repos/java/SQLi/cve-repos/OpenClinica
or the same path with dotCMS
are both good examples. I'm concerned that this will slow down developers who work on monorepos. Depending on the results of that profiling, we might want to think about maintaining a faster alternative
cli/tests/e2e/snapshots/test_baseline/test_staged_changes/error.txt
Outdated
Show resolved
Hide resolved
cli/tests/e2e/snapshots/test_baseline/test_staged_changes/error.txt
Outdated
Show resolved
Hide resolved
Tried examples-annotations
This makes sense to what I expected. We're making a new directory copying over everything, so the extra overhead is coming from the system calls for those. My guess is that the warning about the benchmarks above is probably safe to ignore, as for small repos (the tests above) the time spent scanning <<< time spent in system calls, but for large repos it's the opposite |
The benchmark message is happening in a bunch of repos :( We must have introduced a slight slowdown earlier |
Hm, ok, that's not awful, though definitely not great. I'm not a fan of waiting 3 or 4 seconds for pre-commit to complete. And it's quite common for our customers to have monorepos. I don't see an obvious solution, but it's definitely something I expect we'll have to revisit. Maybe with osemgrep we can copy only the files that are affected, for example. |
Yea, I think with osemgrep there'll be a lot more options for speeding this up, but for now this feels fine to me |
yes I think git worktree is a pretty expensive operation on a big monorepo that we want to avoid. |
Users can now run `semgrep <command> --baseline-commit COMMIT` in an unclean working tree. Semgrep will now scan staged files in baseline mode, in addition to its normal functionality. We probably want to add to the documentation somewhere that users should be able to put `ci --baseline-commit HEAD` as args for pre-commit, to scan block based on CI rules and if they are in the blocking column. This will also report findings to the app, so `--dry-run` can be added to not do this. This PR also adds a new pre-commit hook `semgrep-ci` which runs semgrep w/ the above args PR checklist: - [X] Purpose of the code is [evident to future readers](https://semgrep.dev/docs/contributing/contributing-code/#explaining-code) - [X] Tests included or PR comment includes a reproducible test plan - [ ] Documentation is up-to-date - [X] A changelog entry was [added to changelog.d](https://semgrep.dev/docs/contributing/contributing-code/#adding-a-changelog-entry) for any user-facing change - [X] Change has no security implications (otherwise, ping security team) If you're unsure about any of this, please see: - [Contribution guidelines](https://semgrep.dev/docs/contributing/contributing-code)! - [One of the more specific guides located here](https://semgrep.dev/docs/contributing/contributing/)
@ajbt200128 @milanwilliams you might want to comment more on |
Users can now run
semgrep <command> --baseline-commit COMMIT
in an unclean working tree. Semgrep will now scan staged files in baseline mode, in addition to its normal functionality.We probably want to add to the documentation somewhere that users should be able to put
ci --baseline-commit HEAD
as args for pre-commit, to scan block based on CI rules and if they are in the blocking column. This will also report findings to the app, so--dry-run
can be added to not do this.This PR also adds a new pre-commit hook
semgrep-ci
which runs semgrep w/ the above argsPR checklist:
If you're unsure about any of this, please see: