mlpack is a free, open-source and header-only software library for machine learning and artificial intelligence written in C , built on top of the Armadillo library and the ensmallen numerical optimization library.[3] mlpack has an emphasis on scalability, speed, and ease-of-use. Its aim is to make machine learning possible for novice users by means of a simple, consistent API, while simultaneously exploiting C language features to provide maximum performance and maximum flexibility for expert users.[4] mlpack has also a light deployment infrastructure with minimum dependencies, making it perfect for embedded systems and low resource devices. Its intended target users are scientists and engineers.
Initial release | February 1, 2008[1] |
---|---|
Stable release | 4.5.0[2]
/ 18 September 2024 |
Repository | |
Written in | C , Python, Julia, Go |
Operating system | Cross-platform |
Available in | English |
Type | Software library Machine learning |
License | Open source (BSD) |
Website | mlpack |
It is open-source software distributed under the BSD license, making it useful for developing both open source and proprietary software. Releases 1.0.11 and before were released under the LGPL license. The project is supported by the Georgia Institute of Technology and contributions from around the world.
Features
editClassical machine learning algorithms
editmlpack contains a wide range of algorithms that are used to solved real problems from classification and regression in the Supervised learning paradigm to clustering and dimension reduction algorithms. In the following, a non exhaustive list of algorithms and models that mlpack supports:
- Collaborative Filtering
- Decision stumps (one-level decision trees)
- Density Estimation Trees
- Euclidean minimum spanning trees
- Gaussian Mixture Models (GMMs)
- Hidden Markov Models (HMMs)
- Kernel density estimation (KDE)
- Kernel Principal Component Analysis (KPCA)
- K-Means Clustering
- Least-Angle Regression (LARS/LASSO)
- Linear Regression
- Bayesian Linear Regression
- Local Coordinate Coding
- Locality-Sensitive Hashing (LSH)
- Logistic regression
- Max-Kernel Search
- Naive Bayes Classifier
- Nearest neighbor search with dual-tree algorithms
- Neighbourhood Components Analysis (NCA)
- Non-negative Matrix Factorization (NMF)
- Principal Components Analysis (PCA)
- Independent component analysis (ICA)
- Rank-Approximate Nearest Neighbor (RANN)
- Simple Least-Squares Linear Regression (and Ridge Regression)
- Sparse Coding, Sparse dictionary learning
- Tree-based Neighbor Search (all-k-nearest-neighbors, all-k-furthest-neighbors), using either kd-trees or cover trees
- Tree-based Range Search
Class templates for GRU, LSTM structures are available, thus the library also supports Recurrent Neural Networks.
Bindings
editThere are bindings to R, Go, Julia,[5] Python, and also to Command Line Interface (CLI) using terminal. Its binding system is extensible to other languages.
Reinforcement learning
editmlpack contains several Reinforcement Learning (RL) algorithms implemented in C with a set of examples as well, these algorithms can be tuned per examples and combined with external simulators. Currently mlpack supports the following:
- Q-learning
- Deep Deterministic Policy Gradient
- Soft Actor-Critic
- Twin Delayed DDPG (TD3)
Design features
editmlpack includes a range of design features that make it particularly well-suited for specialized applications, especially in the Edge AI and IoT domains. Its C codebase allows for seamless integration with sensors, facilitating direct data extraction and on-device preprocessing at the Edge. Below, we outline a specific set of design features that highlight mlpack's capabilities in these environments:
Low number of dependencies
editmlpack is low dependencies library which makes it perfect for easy deployment of software. mlpack binaries can be linked statically and deployed to any system with minimal effort. The usage of Docker container is not necessary and even discouraged. This makes it suitable for low resource devices, as it requires only the ensmallen and Armadillo or Bandicoot depending on the type of hardware we are planning to deploy to. mlpack uses Cereal library for serialization of the models. Other dependencies are also header-only and part of the library itself.
Low binary footprint
editIn terms of binary size, mlpack methods have a significantly smaller footprint compared to other popular libraries. Below, we present a comparison of deployable binary sizes between mlpack, PyTorch, and scikit-learn. To ensure consistency, the same application, along with all its dependencies, was packaged within a single Docker container for this comparison.
MNIST digit recognizer
(CNN) |
Language detection
(Softmax regression) |
Forest covertype classifier
(decision tree) | |
---|---|---|---|
scikit learn | N/A | 327 MB | 348 MB |
Pytorch | 1.04 GB | 1.03 GB | N/A |
mlpack | 1.23 MB | 1.03 MB | 1.62 MB |
Other libraries exist such as Tensorflow Lite, However, these libraries are usually specific for one method such as neural network inference or training.
Example
editThe following shows a simple example how to train a decision tree model using mlpack, and to use it for the classification. Of course you can ingest your own dataset using the Load function, but for now we are showing the API:
// Train a decision tree on random numeric data and predict labels on test data:
// All data and labels are uniform random; 10 dimensional data, 5 classes.
// Replace with a data::Load() call or similar for a real application.
arma::mat dataset(10, 1000, arma::fill::randu); // 1000 points.
arma::Row<size_t> labels =
arma::randi<arma::Row<size_t>>(1000, arma::distr_param(0, 4));
arma::mat testDataset(10, 500, arma::fill::randu); // 500 test points.
mlpack::DecisionTree tree; // Step 1: create model.
tree.Train(dataset, labels, 5); // Step 2: train model.
arma::Row<size_t> predictions;
tree.Classify(testDataset, predictions); // Step 3: classify points.
// Print some information about the test predictions.
std::cout << arma::accu(predictions == 2) << " test points classified as class "
<< "2." << std::endl;
The above example demonstrate the simplicity behind the API design, which makes it similar to popular Python based machine learning kit (scikit-learn). Our objective is to simplify for the user the API and the main machine learning functions such as Classify and Predict. More complex examples are located in the examples repository, including documentations for the methods
Backend
editArmadillo is the default linear algebra library that is used by mlpack, it provide matrix manipulation and operation necessary for machine learning algorithms. Armadillo is known for its efficiency and simplicity. it can also be used in header-only mode, and the only library we need to link against are either OpenBLAS, IntelMKL or LAPACK.
Bandicoot
editBandicoot[6] is a C Linear Algebra library designed for scientific computing, it has the an identical API to Armadillo with objective to execute the computation on Graphics Processing Unit (GPU), the purpose of this library is to facilitate the transition between CPU and GPU by making a minor changes to the source code, (e.g. changing the namespace, and the linking library). mlpack currently supports partially Bandicoot with objective to provide neural network training on the GPU. The following examples shows two code blocks executing an identical operation. The first one is Armadillo code and it is running on the CPU, while the second one can runs on OpenCL supported GPU or NVIDIA GPU (with CUDA backend)
using namespace arma;
mat X, Y;
X.randu(10, 15);
Y.randu(10, 10);
mat Z = 2 * norm(Y) * (X * X.t() - Y);
using namespace coot;
mat X, Y;
X.randu(10, 15);
Y.randu(10, 10);
mat Z = 2 * norm(Y) * (X * X.t() - Y);
ensmallen
editensmallen[7] is a high quality C library for non linear numerical optimizer, it uses Armadillo or bandicoot for linear algebra and it is used by mlpack to provide optimizer for training machine learning algorithms. Similar to mlpack, ensmallen is a header-only library and supports custom behavior using callbacks functions allowing the users to extend the functionalities for any optimizer. In addition ensmallen is published under the BSD license.
ensmallen contains a diverse range of optimizer classified based on the function type (differentiable, partially differentiable, categorical, constrained, etc). In the following we list a small set of optimizer that available in ensmallen. For the full list please check this documentation website.
- Limited memory Broyden–Fletcher–Goldfarb–Shanno (L-BFGS)
- GradientDescent
- FrankWolfe
- Covariance matrix adaptation evolution strategy (CMA-ES)
- AdaBelief
- AdaBound
- AdaDelta
- AdaGrad
- AdaSqrt
- Adam
- AdaMax
- AMSBound
- AMSGrad
- Big Batch SGD
- Eve
- FTML
- IQN
- Katyusha
- Lookahead
- Momentum SGD
- Nadam
- NadaMax
- NesterovMomentumSGD
- OptimisticAdam
- QHAdam
- QHSGD
- RMSProp
- SARAH/SARAH
- Stochastic Gradient Descent SGD
- Stochastic Gradient Descent with Restarts (SGDR)
- Snapshot SGDR
- SMORMS3
- SPALeRA
- SWATS
- SVRG
- WNGrad
Support
editmlpack is fiscally sponsored and supported by NumFOCUS, Consider making a tax-deductible donation to help the developers of the project. In addition mlpack team participates each year Google Summer of Code program and mentors several students.
See also
editReferences
edit- ^ "Initial checkin of the regression package to be released · mlpack/mlpack". February 8, 2008. Retrieved May 24, 2020.
- ^ "Release 4.5.0". 18 September 2024. Retrieved 22 September 2024.
- ^ Ryan Curtin; et al. (2021). "The ensmallen library for flexible numerical optimization". Journal of Machine Learning Research. 22 (166): 1–6. arXiv:2108.12981. Bibcode:2021arXiv210812981C.
- ^ Ryan Curtin; et al. (2023). "mlpack 4: a fast, header-only C machine learning library". Journal of Open Source Software. 8 (82): 5026. arXiv:2302.00820.
- ^ "Mlpack/Mlpack.jl". 10 June 2021.
- ^ "C library for GPU accelerated linear algebra". coot.sourceforge.io. Retrieved 2024-08-12.
- ^ "Home". ensmallen.org. Retrieved 2024-08-12.