Bananas

This library is a collection of tools designed to serve as the starting point for a machine learning (ML) framework. It provides the basic building blocks that every ML framework should have, including:

Interfaces for a common defition of fit, predict, score, etc. across frameworks.
A comprehensive suite of tests that ensures compliance with those interfaces.
Useful constructs for a data pipeline, such as Pipeline or Transformer.
Miscellaneous utilities, such as sampling, preprocessing or image processing functions.

Why another framework?

Why not instead use or extend [Scikit-Learn, Tensorflow, Pytorch, Keras...]? All of those projects have their merits and tons of related utilities. However, each one of them has its own definition of fundamental methods such as fit. For example, in Scikit-Learn, fit is expected to receive all data to be learned. That is impractical with the size of datasets nowadays.

The goal of this project is to provide a starting point for the next iteration of ML frameworks, so that they can be compatible with one another. Imagine a world where we can setup a data pipeline using a data transformer from Scikit-Learn, followed by an embedding from Tensorflow, which is then followed by a classifier from Pytorch. While technically possible, doing that currently would be a tremendous amount of work -- which is why most frameworks reimplement the same common utilities over and over, so developers can use them with ease.

Additionally, nothing prevents us from writing wrappers around current frameworks so they become compliant with the interfaces defined by this library. For example, take a look at TODO, which is a wrapper around Pytorch imlementing compliant estimators and can take advantage of all the tools built around this library.

Design

Input data format

Input features are expected to be in column-first format. In other words, rather than passing a list of samples, ML frameworks compliant with this library are expected to handle a list of features. This is at odds with most (if not all) other ML frameworks, but allows for handling of changing of features at runtime as described in the input change design principle. See the core module or change map documentation for more details.

Note: this design also ensures that images are of consistent shape, see the documentation for high-dimensional data.

Input data batches

This library has made some design choices that set it apart from other ML tooling and frameworks. Fundamentally, it expects users to feed data to esatimators in batches. Either by repeatedly calling fit with each batch or, preferrably, by calling train with an input function that will draw batches online. We rarely have to create our own input functions, that's what the [samplers] (sampling) do.

Input change

Because data comes in incrementally, not all possible values are seen in each batch. For example, a continuous feature may be within a range of [0, .7] in the first few batches but eventually become [-1, 1]. Or a classifier may encounter a label never seen before after the first handful of batches. Some frameworks like Scikit-Learn overcome this by implementing a different training function, partial_fit, which expects users to pass all possible values, including unseen ones, as a parameter. This library follows the principle of making no assumptions about data but being able to handle runtime changes. In some cases, changes result in outputs of different dimensions and hence incompatible with other components downstream in the pipeline (like doing one-hot encoding and finding a label never seen before). This library handles this by forcing all transformers and estimators to implement "input change" event handling which allows for handling of addition and removal of features at runtime, losing as little work as possible during training.

Modularity

Even though this library is designed to be an ML framework, the tools included with it should be useful for purposes beyond that. For example, a lot of the functionality contained within the sampling and transformers modules should be reusable in projects with different goals.

Dependencies

This project aims to have as few dependencies as possible, keeping them optional wherever feasible. For example, many image utility functions require the Pillow module but it is not a global requirement. The largest dependency of global scope currently is the numpy numerical library; but turning that into an optional dependency is currently being evaluated as an option.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
bananas		bananas
tests		tests
.gitignore		.gitignore
.pylintrc		.pylintrc
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bananas

Why another framework?

Design

Input data format

Input data batches

Input change

Modularity

Dependencies

About

Releases

Packages

Contributors 2

Languages

License

owahltinez/bananas

Folders and files

Latest commit

History

Repository files navigation

Bananas

Why another framework?

Design

Input data format

Input data batches

Input change

Modularity

Dependencies

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages