Generative models are interesting topic in ML. Generative models are a subset of unsupervised learning that generate new sample/data by using given some training data. There are different types of ways of modelling same distribution of training data: Auto-Regressive models, Auto-Encoders and GANs. In this tutorial, we are focusing theory of generative models, demonstration of generative models, important papers, courses related generative models. It will continue to be updated over time.
Keywords: Bayesian Classifier Sampling, Variational Auto Encoder (VAE), Generative Adversial Networks (GANs), Popular GANs Architectures, Auto-Regressive Models, Important Generative Model Papers, Courses, etc..
NOTE: This tutorial is only for education purpose. It is not academic study/paper. All related references are listed at the end of the file.
- What is Generative Models?
- Preliminary (Recall)
- Sampling from Bayesian Classifier
- Unsupervised Deep Learning vs Supervised Deep Learning (Recall)
- Gaussian Mixture Model (GMM)
- AutoEncoders
- Variational Inference (VI)
- Variational AutoEncoder (VAE)
- Generative Adversial Networks (GANs)
- Auto-Regressive Models
- Generative Model in Reinforcement Learning
- Important Papers
- Courses
- References
- Notes
-
Generative models are a subset of unsupervised learning that generate new sample/data by using given some training data (from same distribution). In our daily life, there are huge data generated from electronic devices, computers, cameras, iot, etc. (e.g. millions of images, sentences, or sounds, etc.)
-
In generative models, a large amount of data in some domain firstly is collected and then model is trained with this large amount of data to generate data like it.
-
There are lots of applications for generative models:
- Image Denoising,
- Image Enhancement,
- Image Inpainting,
- Super-resolution (upsampling): SRGAN,
- Generate 3D objects: 3DGAN, Video
- Creating Art:
- Pose Guided Person Image Generation
- Creating clothing images and styles from an image: PixelDTGAN
- Face Synthesis: TP-GAN
- Image-to-Image Translation: Pix2Pix
- Labels to Street Scene
- Aerial to Map
- Sketch to Realistic Image
- High-resolution Image Synthesis
- Text to image: StackGAN, Paper
- Text to Image Synthesis
- Learn Joint Distribution: CoGAN
- Transfering style (or patterns) from one domain (handbag) to another (shoe): DiscoGAN
- Texture Synthesis: MGAN
- Image Editing: IcGAN
- Face Aging: Age-cGAN
- Neural Photo Editor
- Medical (Anomaly Detection): AnoGAN
- Music Generation: MidiNet
- Video Generation
- Image Blending: GP-GAN
- Object Detection: PerceptualGAN
- Natural Language:
- Artical Spinning
-
Generative models are also promising in the long term future because it has a potential power to learn the natural features of a dataset automatically.
-
Generative models are mostly used to generate images (vision area). In the recent studies, it will also used to generate sentences (natural language processing area).
-
It can be said that Generative models begins with sampling. Generative models are told in this tutorial according to the development steps of generative models: Sampling, Gaussian Mixture Models, Variational AutoEncoder, Generative Adversial Networks.
- Bayesian Rule: p(z|x)= p(x|z) p(z) /p(x)
- Prior Distribution: "often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into account." (e.g. p(z))
- Posterior Distribution: is a probability distribution that represents your updated beliefs about the parameter after having seen the data. (e.g. p(z|x))
- Posterior probability = prior probability new evidence (called likelihood)
- Probability Density Function (PDF): the set of possible values taken by the random variable
- Gaussian (Normal) Distribution: A symmetrical data distribution, where most of the results lie near the mean.
- Bayesian Analysis:
- Prior distribution: p(z)
- Gather data
- "Update your prior distribution with the data using Bayes' theorem to obtain a posterior distribution."
- "Analyze the posterior distribution and summarize it (mean, median, etc.)"
- It is expected that you have knowledge of neural network concept (gradient descent, cost function, activation functions, regression, classification)
- Typically used for regression or classification
- Basically: fit(X,Y) and predict(X)
- We use sampling data to generate new samples (using distribution of the training data).
- If we know the probability distribution of the training data , we can sample from it.
- Two ways of sampling:
- First Method: Sample from a given digit/class
- Pick a class, e.g. y=2
- If it is known p(x|y=2) is Gaussian
- Sample from this Gaussian using Scipy (mvn.rvs(mean,cov))
- Second Method: Sample from p(y)p(x|y) => p(x,y)
- If there is graphical model (e.g. y-> x), and they have different distribution,
- If p(y) and p(x|y) are known and y has its own distribution (e.g. categorical or discrete distribution)
- Sample
- First Method: Sample from a given digit/class
- Unsupervised Deep Learning: trying to learn structure of inputs
- Example1: If the structure of poetry/text can be learned, it is possible to generate text/poetry that resembles the given text/poetry.
- Example2: If the structure of art can be learned, it is possible to make new art/drawings that resembles the given art/drawings.
- Example3: If the structure of music can be learned, it is possible to create new music that resembles the given music.
- Supervised Deep Learning: trying to map inputs to targets
- Single gaussian model learns blurry images if there are more than one gaussian distribution (e.g. different types of writing digits in handwriting).
- To get best result, GMM have to used to model more than one gaussian distribution.
- GMM is latent variable model. With GMM, multi-modal distribution can be modelled at the same time.
- Multiple gaussians in different proportions are fitted into the GMM.
- 2 clusters: p(x)=p(z=1) p(x|z=1) p(z=2) p(x|z=2). In figure, there are 2 different proportions gaussian distributions.
- GMM is trained using Expectation-Maximization (EM)
- EM is iterative algorithm that let the likelihood improves at each step.
- The aim of EM is to reach maximum likelihood.
- A neural network that predicts (reconstructs) its own input.
- It is a feed forward network.
- W: weight, b:bias, x:input, f() and g():activation functions, z: latent variable, x_hat= output (reconstructed input)
- Instead of fit(X,Y) like neural networks, autoencoders fit(X,X).
- It has 1 input, 1 hidden, 1 output layers (hidden layer size < input layer size; input layer size = output layer size)
- It forces neural network to learn compact/efficient representation (e.g. dimension reduction/ compression)
- Variational inference (VI) is the significant component of Variational AutoEncoders.
- VI ~ Bayesian extension of EM.
- In GMM/K-Means Clustering, you have choose the number of clusters.
- VI-GMM (Variational inference-Gaussian Mixture Model) automatically finds the number of cluster.
- VAE is a neural network that learns to generate its input.
- It can map data to latent space, then generate samples using latent space.
- VAE is combination of autoencoders and variational inference.
- It doesn't work like tradional autoencoders.
- Its output is the parameters of a distribution: mean and variance, which represent a Gaussian-PDF of Z (instead only one value)
- VAE consists of two units: Encoder, Decoder.
- The output of encoder represents Gaussian distributions.(e.g. In 2-D Gaussian, encoder gives 2 mean and 2 variance/stddev)
- The output of decoder represents Bernoulli distributions.
- From a probability distribution, new samples can be generated.
- For example: let's say input x is a 28 by 28-pixel photo
- The encoder ‘encodes’ the data which is 784-dimensional into a latent (hidden) representation space z.
- The encoder outputs parameters to q(z∣x), which is a Gaussian probability density.
- The decoder gets as input the latent representation of a digit z and outputs 784 Bernoulli parameters, one for each of the 784 pixels in the image.
- The decoder ‘decodes’ the real-valued numbers in z into 784 real-valued numbers between 0 and 1.
- Encoder: x-> q(z) {q(z): latent space, coded version of x}
- Decoder: q(z)~z -> x_hat
- Encoder takes the input of image "8" and gives output q(z|x).
- Sample from q(z|x) to get z
- Get p(x_hat|x), sample from it (this is called posterior predictive sample)
- Evidence Lower Bound (ELBO) is our objective function that has to be maximized.
- ELBO consists of two terms: Expected Log-Likelihood of the data and KL divergence between q(z|x) and p(z).
- Expected Log-Likelihood is negative cross-entropy between original data and recontructed data.
- "Expected Log-Likelihood encourages the decoder to learn to reconstruct the data. If the decoder’s output does not reconstruct the data well, it will incur a large cost in this loss function".
- If the input and output have Bernoulli distribution, Expected Log-Likelihood can be calculated like this:
- "KL divergence measures how much information is lost (in units of nats) when using q to represent p. It is one measure of how close q is to p".
- KL divergence provides to compare 2 probability distributions.
- If the two probability distributions are exactly same (q=p), KL divergence equals to 0.
- If the two probability distributions are not same (q!=p), KL divergence > 0 .
- Cost function consists of two part: How the model's output is close to target and regularization.
- COST = (TARGET-OUTPUT) PENALTY-REGULARIZATION PENALTY == RECONSTRUCTION PENALTY - REGULARIZATION PENALTY
- GANs are interesting because it generates samples exceptionally good.
- GANs are used in different applications (details are summarized following sections).
- GANs are different form other generative models (Bayesian Classifier, Variational Autoencoders, Restricted Boltzmann Machines). GANs are not dealing with explicit probabilities, instead, its aim is to reach Nash Equilibrium of a game.
- RBM generate samples with Monte Carlo Sampling (thousands of iterations are needed to generate, and how many iterations are needed is not known). GANs generate samples with in single pass.
- There are 2 different networks in GANs: generator and discriminator, compete against each other.
- Generator Network tries to fool the discriminator.
- Disciminator Network tries not to be fooled.
- Paper: Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative Adversarial Networks
- Generator and Discriminator try to optimize the opposite cost functions.
- Discriminator classifies images as a real or fake images with binary classification.
- t: target; y: output probability of the discriminator.
- Real image: t=1; fake image: t=0; y= p(image is real | image) between (0,1)
- Binary cost function evaluates discriminator cost function.
- x: real images only, x_hat: fake images only
- G(z): fake image from generator, x: real image
- Total cost: summing to individual negative log-likelihoods (batches of data in training)
- Estimate of the expected value over all possible data
- Discrimator Cost Function:
- Relation between Generator and Discriminator Cost Functions:
- In game theory, this situation is called "zero-sum game". Sum of all players in the game is 0.
- Theta_G: parameters of generator; Theta_D: parameters of discriminator
- Pseudocode of GANs:
- In the first step: Generator generates some of the real samples and fake samples.
- In the second step: Gradient descent of the discriminator is run one iteration.
- In the third step: Gradient descent of the generator is run one iteration.
- Repeat this until seeing the good samples.
- Some of the researches run third step twice to get better results.
- Paper: Radford, A., Metz, L., and Chintala, S., Unsupervised representation learning with deep convolutional generative adversarial networks
- DCGAN architecture produces high quality and high resolution images in a single pass.
- DCGANs contain batch normalization (batch norm: z=(x-mean)/std, batch norm is used between layers).
- DCGANs also contain only all-convolutional layers instead of contaning convolution, pooling, linear layers together.
- It optimizes using ADAM optimizer (adaptive gradient desdent algorithm)
- Discriminator uses Leaky-ReLU (Rectified Linear Unit), generator uses normal ReLU.
- Typical Convolution: input size is bigger or equal than output size (Stride>1).
- Deconvolution: input size is smaller than output size (Stride<1).
- Stride size is smaller than 1.
- If stride=2 is used while convolution operation, output image size is 1/2 original image size.
- If stride=1/2 is used while convolution operation, output image size is 2x original image size.
- With fractionally-strided convolution (deconvolution), output image size is bigger than input image size.
- Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks
- Their algorithm translate an image from one to another:
- Transfer from Monet paintings to landscape photos from Flickr, and vice versa.
- Transfer from zebras to horses, and vice versa.
- Transfer from summer to winter photos, and vice versa.
- Image-to-Image Translation with Conditional Adversarial Networks
- Pix2Pix is an image-to-image translation algorithm: aerials to map, labels to street scene, labels to facade, day to night, edges to photo.
- Pixel-Level Domain Transfer
- PixelDTGAN generates clothing images from an image.
- "The model transfers an input domain to a target domain in semantic level, and generates the target image in pixel level."
- "They verify their model through a challenging task of generating a piece of clothing from an input image of a dressed person"
- Pose Guided Person Image Generation
- "This paper proposes the novel Pose Guided Person Generation Network (PG2 that allows to synthesize person images in arbitrary poses, based on an image of that person and a novel pose"
- Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network
- SRGAN: "a generative adversarial network (GAN) for image super-resolution (SR)"
- They generate super-resolution images from the lower resolution images.
- StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks
- Stacked Generative Adversarial Networks (Stack-GAN): "to generate photo-realistic images conditioned on text descriptions".
- Input: sentence, Output: multiple images fitting the description.
- Beyond Face Rotation: Global and Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis
- "Synthesis faces in different poses: With a single input image, they create faces in different viewing angles."
- Towards the Automatic Anime Characters Creation with Generative Adversarial Networks
- "They explore the training of GAN models specialized on an anime facial image dataset."
- They build website for their implementation (https://make.girls.moe)
- Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling
- This paper proposed creating 3D objects with GAN.
- "They demonstrated that their models are able to generate novel objects and to reconstruct 3D objects from images"
- FACE AGING WITH CONDITIONAL GENERATIVE ADVERSARIAL NETWORKS
- They proposed the GAN-based method for automatic face aging.
- Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery
- "A deep convolutional generative adversarial network to learn a manifold of normal anatomical variability".
- Learning to Discover Cross-Domain Relations with Generative Adversarial Networks
- Proposed method transfers style from one domain to another (e.g handbag -> shoes)
- "DiscoGAN learns cross domain relationship without labels or pairing".
- Unsupervised Cross-Domain Image Generation
- Proposed method is to create emoji from pictures.
- "They can synthesize an SVHN image that resembles a given MNIST image, or synthesize a face that matches an emoji."
- Invertible Conditional GANs for image editing
- "They evaluate encoders to inverse the mapping of a cGAN, i.e., mapping a real image into a latent space and a conditional representation".
- Proposed method is to reconstruct or edit images with specific attribute.
- Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks
- "Markovian Generative Adversarial Networks (MGANs), a method for training generative neural networks for efficient texture synthesis."
- "They apply this idea to texture synthesis, style transfer, and video stylization."
- MIDINET: A CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORK FOR SYMBOLIC-DOMAIN MUSIC GENERATION
- "They propose a novel conditional mechanism to exploit available prior knowledge, so that the model can generate melodies either from scratch, by following a chord sequence, or by conditioning on the melody of previous bars"
- "MidiNet can be expanded to generate music with multiple MIDI channels"
- Perceptual Generative Adversarial Networks for Small Object Detection
- "Proposed method improves small object detection through narrowing representation difference of small objects from the large ones"
- "The basic difference between Generative Adversarial Networks (GANs) and Auto-regressive models is that GANs learn implicit data distribution whereas the latter learns an explicit distribution governed by a prior imposed by model structure" [Sharma].
- Some of the advantages of Auto-Regressive Models over GANs:
- Provides a way to calculate likelihood: They have advantage of returning explicit probability densities. Hence it can be applied in the application areas related compression and probabilistic planning and exploration.
- The training is more stable than GANs:"Training a GAN requires finding the Nash equilibrium". Training of PixelRNN, PixelCNN are more stable than GANs.
- It works for both discrete and continuous data:"It’s hard to learn to generate discrete data for GAN, like text" [Sharma].
Paper: Oord et al., Pixel Recurrent Neural Networks (proposed from Google DeepMind)
- It is used for image completion applications.
- "It uses probabilistic density models (like Gaussian or Normal distribution) to quantify the pixels of an image as a product of conditional distributions."
- "This approach turns the modeling problem into a sequence problem wherein the next pixel value is determined by all the previously generated pixel values".
- There are four different methods to implement PixelRNN:
- Row LSTM
- Diagonal BiLSTM
- Fully Convolutional Network
- Multi Scale Network.
- Cost function: "Negative log likelihood (NLL) is used as the loss and evaluation metric as the network predicts(classifies) the values of pixel from values 0–255."
- Papers: Oord et al., Pixel Recurrent Neural Networks; Oord et al., Conditional Image Generation with PixelCNN Decoders (proposed from Google DeepMind).
- "The main drawback of PixelRNN is that training is very slow as each state needs to be computed sequentially. This can be overcome by using convolutional layers and increasing the receptive field."
- "PixelCNN lowers the training time considerably as compared to PixelRNN."
- "The major drawback of PixelCNN is that it’s performance is worse than PixelRNN. Another drawback is the presence of a Blind Spot in the receptive field"
- Paper: Salimans et al.,PIXELCNN : IMPROVING THE PIXEL CNN WITH DISCRETIZED LOGISTIC MIXTURE LIKELIHOOD AND OTHER MODIFICATIONS
- PixelCNN improves the performance of PixelCNN (proposed from OpenAI)
- Modifications:
- Discretized logistic mixture likelihood
- Conditioning on whole pixels
- Downsampling
- Short-cut connections
- Dropout
- "PixelCNN outperforms both PixelRNN and PixelCNN by a margin. When trained on CIFAR-10 the best test log-likelihood is 2.92 bits/pixel as compared to 3.0 of PixelRNN and 3.03 of gated PixelCNN."
- Details are in the paper PIXELCNN : IMPROVING THE PIXEL CNN WITH DISCRETIZED LOGISTIC MIXTURE LIKELIHOOD AND OTHER MODIFICATIONS.
Paper: Jonathan Ho, Stefano Ermon, Generative Adversarial Imitation Learning
- "The standard reinforcement learning setting usually requires one to design a reward function that describes the desired behavior of the agent. However, in practice this can sometimes involve expensive trial-and-error process to get the details right. In contrast, in imitation learning the agent learns from example demonstrations (for example provided by teleoperation in robotics), eliminating the need to design a reward function. This approach can be used to learn policies from expert demonstrations (without rewards) on hard OpenAI Gym environments, such as Ant and Humanoid." [Blog Open-AI].
- "Popular imitation approaches involve a two-stage pipeline: first learning a reward function, then running RL on that reward. Such a pipeline can be slow, and because it’s indirect, it is hard to guarantee that the resulting policy works well. This work shows how one can directly extract policies from data via a connection to GANs" [Blog Open-AI].
- Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio, Generative Adversarial Networks
- Radford, A., Metz, L., and Chintala, S., Unsupervised representation learning with deep convolutional generative adversarial networks
- Jonathan Ho, Stefano Ermon, Generative Adversarial Imitation Learning
- Ledig et al., Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network,
- Jiajun Wu et al., Learning a Probabilistic Latent Space of Object Shapes via 3D Generative-Adversarial Modeling
- Jin et al., Towards the Automatic Anime Characters Creation with Generative Adversarial Networks
- Zhu et al., Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks
- Taigman et al., UNSUPERVISED CROSS-DOMAIN IMAGE GENERATION
- MA et al., Pose Guided Person Image Generation
- Yoo et al., Pixel-Level Domain Transfer
- Huang aet al., Beyond Face Rotation: Global and Local Perception GAN for Photorealistic and Identity Preserving Frontal View Synthesis
- Isola et al., Image-to-Image Translation with Conditional Adversarial Networks
- Wang et al., High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs
- Zhang et al., StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks
- Reed et al., Generative Adversarial Text to Image Synthesis
- Liu et al., Coupled Generative Adversarial Networks
- Kim et al., Learning to Discover Cross-Domain Relations with Generative Adversarial Networks
- Li et al., Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks
- Perarnau et al., Invertible Conditional GANs for image editing
- Antipov et al., FACE AGING WITH CONDITIONAL GENERATIVE ADVERSARIAL NETWORKS
- Schlegl et al., Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery
- Yang et al., MIDINET: A CONVOLUTIONAL GENERATIVE ADVERSARIAL NETWORK FOR SYMBOLIC-DOMAIN MUSIC GENERATION
- Wu et al., GP-GAN: Towards Realistic High-Resolution Image Blending
- Li et al., Perceptual Generative Adversarial Networks for Small Object Detection
- Oord et al., Pixel Recurrent Neural Networks
- Oord et al., Conditional Image Generation with PixelCNN Decoders
- Salimans et al.,PIXELCNN : IMPROVING THE PIXEL CNN WITH DISCRETIZED LOGISTIC MIXTURE LIKELIHOOD AND OTHER MODIFICATIONS
- Springenberg et al., Striving for Simplicity: The All Convolutional Net