The official code to reproduce results in the NACCL2019 paper: White-to-Black: Efficient Distillation of Black-Box Adversarial Attacks
The code is divided into sub-packages:
1. ./Agents - adversarial learned attck generators
2. ./Attacks - optimization attacks like hot flip
3. ./Toxicity Classifier - a classifier of sentences toxic/non toxic
4. ./Data - data handling
5. ./Resources - resources for other categories
As seen in the figure below we train a classifier to predict the class of toxic and non-toxic sentences.
We attack this model using a white-box algorithm called hot-flip and distill the knowledge into a second model - DistFlip
.
DistFlip
is able to generate attacks in a black-box manner.
These attacks generalize well to the Google Perspective algorithm (tested Jan 2019).
We used the data from this kaggle challenge by Jigsaw
For data flip using HotFlip you can download
the data from Google Drive
and unzip it into: ./toxic_fool/resources/data
The number of flips needed to change the label of a sentences using the original white box algorithm and ours (green)