About the Project

The Datumbox Machine Learning Framework is an open-source framework written in Java which allows the rapid development of Machine Learning and Statistical applications. The main focus of the framework is to include a large number of machine learning algorithms & statistical tests and be able to handle large sized datasets. The code of the Machine Learning Framework can be found on Github.

Technical Details

The core part of the project is about 30000 lines of code, uses Java 8 features and uses Maven Project Structure. The code is licensed under the Apache License, Version 2.0 so feel free to clone the repository and experiment with it. If you find a bug or decide to document particular parts of the code, please consider contributing your changes by sending a pull request.

What methods/algorithms are supported?

The Framework currently supports performing multiple Parametric & non-parametric Statistical tests, calculating descriptive statistics on censored & uncensored data, performing ANOVA, Cluster Analysis, Dimension Reduction, Regression Analysis, Timeseries Analysis, Sampling and calculation of probabilities from the most common discrete and continues Distributions. In addition it provides several implemented algorithms including Max Entropy, Naïve Bayes, SVM, Bootstrap Aggregating, Adaboost, Kmeans, Hierarchical Clustering, Dirichlet Process Mixture Models, Softmax Regression, Ordinal Regression, Linear Regression, Stepwise Regression, PCA and several other techniques that can be used for feature selection, ensemble learning, linear programming solving and recommender systems.