Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT Single turn crescendo #388

Open
romanlutz opened this issue Sep 20, 2024 · 2 comments
Open

FEAT Single turn crescendo #388

romanlutz opened this issue Sep 20, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@romanlutz
Copy link
Contributor

Is your feature request related to a problem? Please describe.

We don't support single turn crescendo yet. This should be added.

Paper: https://arxiv.org/pdf/2409.03131v1

GitHub repo (results only): https://github.com/alanaqrawi/STCA

Describe the solution you'd like

The tricky part is that for every goal/objective (e.g., "how to create a molotov cocktail") the conversation looks very different. We'll need to be able to generate the entire conversation until the n-th step and then put that into a single prompt. The assumption here has to be that the attack target is a single turn target (otherwise we can just use "normal" crescendo). So the red teaming LLM has to generate both sides of the conversation. An alternative (mentioned by Alan, the author of the paper), is to run full Crescendo and keep the questions and responses, then put them in a single prompt. That may or may not be possible in actual operations (and definitely not with single turn targets).

Importantly, the n should be configurable. The paper has some discussion of that and we probably want to be flexible.

The final solution needs to have tests and a simple notebook (like all orchestrators). There's some freedom in terms of how to do this that depends on how the conversation generation works best:

  • custom orchestrator: first generate conversation (which may be a single step or multiple), then send to target
  • converter: converter generates conversation from just the goal [It's not a typical converter, though....]
  • another way? If this is chosen please discuss here with dev team first.

Describe alternatives you've considered, if relevant

Alternatively, one could pregenerate such single turn crescendo templates for hundreds of goals, but that will never be comprehensive...

Additional context

One tricky aspect is that the responses need to be somewhat similar to how the target model responds. Otherwise, it may get "suspicious" (not trying to anthropomorphize here but it's the simplest way to explain what I mean) and refuse to comply.

@romanlutz romanlutz added enhancement New feature or request help wanted Extra attention is needed labels Sep 20, 2024
@roeybc
Copy link
Contributor

roeybc commented Sep 20, 2024

Hey! I'm up for it!

@alanaqrawi
Copy link

Let me know if you have questions folks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants