Machine learning algorithms implemented in Scala on Spark
Currently 4 models are included:
- Gaussian Naive Bayes – Naive Bayes classifier for continuous features. Assumes likelihoods follow Gaussian distribution P(x_i | y) = (1/sqrt(2 * pi * sigma_y^2)) * exp(-((x_i - mu_y)^2)/2 * pi * sigma_y^2). The posterior distribution for each class is estimated by summing the exponential of all likelihoods and for a given class and class prior probability.
- K Means – Performs k-means clustering on data samples labeled by class. The distance function
distMeasure
may be specified as eithereuclidean
(default) orcosine
. Distance functions are passed internally as partially defined functions for extensibility. Both the means and standard deviations are calculated and recorded for each cluster - useful for generating radial basis functions based on distance from clusters. - Logistic Regression – Binary logistic regression classifier with L2 normalization. Loss function is minimized with gradient descent
- Softmax Logistic Regression – Multi-class logistic regression with optional regularizations: L1, L1 (with clipping), L2, none (default). Regularization gradient update functions are specified and passed as partials for extensibility.