white2black

INTRODUCTION

The official code to reproduce results in the NACCL2019 paper: White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks

The code is divided into sub-packages:

1. ./Agents - adversarial learned attck generators

2. ./Attacks - optimization attacks like hot flip

3. ./Toxicity Classifier - a classifier of sentences toxic/non toxic

4. ./Data - data handling

5. ./Resources - resources for other categories

ALGORITHM

As seen in the figure below we train a classifier to predict the class of toxic and non-toxic sentences. We attack this model using a white-box algorithm called hot-flip and distill the knowledge into a second model - DistFlip. DistFlip is able to generate attacks in a black-box manner. These attacks generalize well to the Google Perspective algorithm (tested Jan 2019).

DATA

We used the data from this kaggle challenge by Jigsaw

For data flip using HotFlip you can download the data from Google Drive and unzip it into: ./toxic_fool/resources/data

RESULTS

The number of flips needed to change the label of a sentences using the original white box algorithm and ours (green)

Some example sentences:

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
doc		doc
toxic_fool		toxic_fool
.gitattributes		.gitattributes
.gitignore		.gitignore
.travis.yml		.travis.yml
README.md		README.md
conftest.py		conftest.py
pylint.rc		pylint.rc
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

white2black

INTRODUCTION

1. ./Agents - adversarial learned attck generators

2. ./Attacks - optimization attacks like hot flip

3. ./Toxicity Classifier - a classifier of sentences toxic/non toxic

4. ./Data - data handling

5. ./Resources - resources for other categories

ALGORITHM

DATA

RESULTS

About

Releases

Packages

Contributors 3

Languages

orgoro/white-2-black

Folders and files

Latest commit

History

Repository files navigation

white2black

INTRODUCTION

1. ./Agents - adversarial learned attck generators

2. ./Attacks - optimization attacks like hot flip

3. ./Toxicity Classifier - a classifier of sentences toxic/non toxic

4. ./Data - data handling

5. ./Resources - resources for other categories

ALGORITHM

DATA

RESULTS

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages