Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cli): Allow diff scans to happen in dirty repos #7973

Merged
merged 4 commits into from
Jun 13, 2023
Merged

Conversation

ajbt200128
Copy link
Collaborator

@ajbt200128 ajbt200128 commented Jun 6, 2023

Users can now run semgrep <command> --baseline-commit COMMIT in an unclean working tree. Semgrep will now scan staged files in baseline mode, in addition to its normal functionality.

We probably want to add to the documentation somewhere that users should be able to put
ci --baseline-commit HEAD as args for pre-commit, to scan block based on CI rules and if they are in the blocking column. This will also report findings to the app, so --dry-run can be added to not do this.

This PR also adds a new pre-commit hook semgrep-ci which runs semgrep w/ the above args

PR checklist:

  • Purpose of the code is evident to future readers
  • Tests included or PR comment includes a reproducible test plan
  • Documentation is up-to-date
  • A changelog entry was added to changelog.d for any user-facing change
  • Change has no security implications (otherwise, ping security team)

If you're unsure about any of this, please see:

@ajbt200128 ajbt200128 requested review from aryx and underyx June 6, 2023 22:50
@linear
Copy link

linear bot commented Jun 6, 2023

PA-2795 semgrep ci scans staged changes

During a diff scan with a pre-commit hook - semgrep ci reports the following error:
[ERROR] Found pending changes in tracked files. Baseline scans runs require a clean git state.

This will always be the case for "pre-commit" scans.

Can we include the staged changes in the scan?

@ajbt200128 ajbt200128 changed the title Allow diff scans to happen in dirty repos fix(cli): Allow diff scans to happen in dirty repos Jun 6, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Jun 6, 2023

🚫 The whole benchmark suite is too slow: 10.2% ( 1.102 s)

14 benchmarks, 10.2% slower on average.

Individual deviations greater than 20% from the baseline are reported. An individual performance degradation of over 30% or a global degradation of over 7% is an error and will block the pull request. See run output for full results ('Show all checks' > 'Tests / semgrep benchmark tests' 'Details').

@ajbt200128 ajbt200128 requested review from a team, IagoAbal and emjin and removed request for a team June 7, 2023 17:43
@emjin
Copy link
Collaborator

emjin commented Jun 12, 2023

There was a thread earlier where someone posted about a tool they made to do something similar. Would be great to respond in that thread once this is out to update them! https://semgrep.slack.com/archives/C01LNF69F9B/p1682841222100279

Copy link
Collaborator

@emjin emjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tests! I'm approving this from my knowledge, but it might be worth checking with @underyx and @brendongo if there are any potentially bad states that we should be testing. For instance, I don't know if there are any places where making a temporary directory could be problematic.

I would also appreciate some profiling of baseline scanning on large repos. You can clone https://github.com/returntocorp/examples-annotations or https://github.com/returntocorp/stress-test-monorepo for the purpose. If you do the former, examples-annotations/repos/java/SQLi/cve-repos/OpenClinica or the same path with dotCMS are both good examples. I'm concerned that this will slow down developers who work on monorepos. Depending on the results of that profiling, we might want to think about maintaining a faster alternative

.pre-commit-hooks.yaml Outdated Show resolved Hide resolved
cli/src/semgrep/git.py Show resolved Hide resolved
cli/src/semgrep/git.py Show resolved Hide resolved
@ajbt200128
Copy link
Collaborator Author

I would also appreciate some profiling of baseline scanning on large repos.

Tried examples-annotations

  • examples-annotations/repos/java/SQLi/cve-repos/OpenClinica
    • v1.25.0 1.30s user 0.21s system 43% cpu 3.495 total
    • this branch 1.18s user 0.46s system 40% cpu 3.986 total
  • .../dotCMS
    • v1.25.0 1.28s user 0.22s system 48% cpu 3.083 total
    • this branch 1.18s user 0.45s system 52% cpu 3.121 total

This makes sense to what I expected. We're making a new directory copying over everything, so the extra overhead is coming from the system calls for those. My guess is that the warning about the benchmarks above is probably safe to ignore, as for small repos (the tests above) the time spent scanning <<< time spent in system calls, but for large repos it's the opposite

@emjin
Copy link
Collaborator

emjin commented Jun 13, 2023

The benchmark message is happening in a bunch of repos :( We must have introduced a slight slowdown earlier

@emjin
Copy link
Collaborator

emjin commented Jun 13, 2023

Hm, ok, that's not awful, though definitely not great. I'm not a fan of waiting 3 or 4 seconds for pre-commit to complete. And it's quite common for our customers to have monorepos. I don't see an obvious solution, but it's definitely something I expect we'll have to revisit. Maybe with osemgrep we can copy only the files that are affected, for example.

@ajbt200128
Copy link
Collaborator Author

Yea, I think with osemgrep there'll be a lot more options for speeding this up, but for now this feels fine to me

@ajbt200128 ajbt200128 merged commit a6ae72f into develop Jun 13, 2023
@ajbt200128 ajbt200128 deleted the austin/pa-2795 branch June 13, 2023 21:15
@aryx
Copy link
Collaborator

aryx commented Jun 14, 2023

yes I think git worktree is a pretty expensive operation on a big monorepo that we want to avoid.
What we could do instead is just iterate over all the files modified between the current state and the baseline, get their old content (there is a command in git to show the content of a file at a certain SHA), and run semgrep on that file.

yosefAlsuhaibani pushed a commit that referenced this pull request Jun 21, 2023
Users can now run `semgrep <command> --baseline-commit COMMIT` in an
unclean working tree. Semgrep will now scan staged files in baseline
mode, in addition to its normal functionality.

We probably want to add to the documentation somewhere that users should
be able to put
`ci --baseline-commit HEAD` as args for pre-commit, to scan   block
based on CI rules and if they are in the blocking column. This will also
report findings to the app, so `--dry-run` can be added to not do this.

This PR also adds a new pre-commit hook `semgrep-ci` which runs semgrep
w/ the above args

PR checklist:

- [X] Purpose of the code is [evident to future
readers](https://semgrep.dev/docs/contributing/contributing-code/#explaining-code)
- [X] Tests included or PR comment includes a reproducible test plan
- [ ] Documentation is up-to-date
- [X] A changelog entry was [added to
changelog.d](https://semgrep.dev/docs/contributing/contributing-code/#adding-a-changelog-entry)
for any user-facing change
- [X] Change has no security implications (otherwise, ping security
team)

If you're unsure about any of this, please see:

- [Contribution
guidelines](https://semgrep.dev/docs/contributing/contributing-code)!
- [One of the more specific guides located
here](https://semgrep.dev/docs/contributing/contributing/)
@aryx
Copy link
Collaborator

aryx commented Jun 28, 2023

@ajbt200128 @milanwilliams you might want to comment more on
navhits/semgrep-precommit#1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants