FEAT Single turn crescendo #388

romanlutz · 2024-09-20T21:11:26Z

Is your feature request related to a problem? Please describe.

We don't support single turn crescendo yet. This should be added.

Paper: https://arxiv.org/pdf/2409.03131v1

GitHub repo (results only): https://github.com/alanaqrawi/STCA

Describe the solution you'd like

The tricky part is that for every goal/objective (e.g., "how to create a molotov cocktail") the conversation looks very different. We'll need to be able to generate the entire conversation until the n-th step and then put that into a single prompt. The assumption here has to be that the attack target is a single turn target (otherwise we can just use "normal" crescendo). So the red teaming LLM has to generate both sides of the conversation. An alternative (mentioned by Alan, the author of the paper), is to run full Crescendo and keep the questions and responses, then put them in a single prompt. That may or may not be possible in actual operations (and definitely not with single turn targets).

Importantly, the n should be configurable. The paper has some discussion of that and we probably want to be flexible.

The final solution needs to have tests and a simple notebook (like all orchestrators). There's some freedom in terms of how to do this that depends on how the conversation generation works best:

custom orchestrator: first generate conversation (which may be a single step or multiple), then send to target
converter: converter generates conversation from just the goal [It's not a typical converter, though....]
another way? If this is chosen please discuss here with dev team first.

Describe alternatives you've considered, if relevant

Alternatively, one could pregenerate such single turn crescendo templates for hundreds of goals, but that will never be comprehensive...

Additional context

One tricky aspect is that the responses need to be somewhat similar to how the target model responds. Otherwise, it may get "suspicious" (not trying to anthropomorphize here but it's the simplest way to explain what I mean) and refuse to comply.

The text was updated successfully, but these errors were encountered:

roeybc · 2024-09-20T21:24:26Z

Hey! I'm up for it!

alanaqrawi · 2024-09-20T22:14:34Z

Let me know if you have questions folks

romanlutz added enhancement New feature or request help wanted Extra attention is needed labels Sep 20, 2024

romanlutz assigned roeybc Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT Single turn crescendo #388

FEAT Single turn crescendo #388

romanlutz commented Sep 20, 2024

roeybc commented Sep 20, 2024

alanaqrawi commented Sep 20, 2024

FEAT Single turn crescendo #388

FEAT Single turn crescendo #388

Comments

romanlutz commented Sep 20, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered, if relevant

Additional context

roeybc commented Sep 20, 2024

alanaqrawi commented Sep 20, 2024