For Continuous Data with binary Targets using the Differentially Private Wasserstein GAN
- DP-WGAN Synthetic Data for "Health care: Heart attack possibility" Kaggle Dataset --> view Notebook
- DP-WGAN Synthetic Data for "BankNote Authentication UCI" Kaggle Dataset --> view Notebook
*after multiple attempts using normalized input data, epsilon = approx 3.4 and delta = 1e-5
- The data needs to be in csv format and has to be partitioned as train and test before feeding it to the models.
- Missing values are not supported and needs to replaced appropriately by the user before usage.
- In case the data has continuous and categorical attributes, it needs to be pre-processed
(discretization for continuous values/ encoding for categorical attr.) - The generative GAN-based ML models are trained using the training dataset.
- The generative model is used to create a synthetic version of the train dataset
- To compensate for irregularities multiple GAN-Generator models are trained
- To compensate for irregularities multiple synthetic datasets are generated,
the optimal best-performing dataset that yields the max AUC is selected - Logistic Regression Classifiers are trained using the real data, as well as, the synthetically generated dataset
- Both classifiers are evaluated regarding performance on the left-out real test dataset (preserved for evaluation)
- Relevant Metrics (mainly AUC) and visualizations of correlation-matrices of synthetic datasets were generated
Major parts of this summary notebook were extracted from this BOREALIS Private Data Generation Github repository by BorealisAI. Note that, this Jupyter notebook covers only one (DP-WGAN) of various possible datasets and generative models for differentially private synthetic data generation. The aforementioned analysis aproaches have yielded the following results as extracted from the original notebook. For more information rearding differential privacy specific privacy arguments Delta & Epsylon please refer to this info-page by Microsoft