Introduction to ML-Flex

ML-Flex uses machine-learning algorithms to derive models from independent variables, with the purpose of predicting the values of a dependent (class) variable. For example, machine-learning algorithms have long been applied to the Iris data set, introduced by Sir Ronald Fisher in 1936, which contains four independent variables (sepal length, sepal width, petal length, petal width) and one dependent variable (species of Iris flowers = setosa, versicolor, or virginica). Deriving prediction models from the four independent variables, machine-learning algorithms can often differentiate between the species with near-perfect accuracy.

One important aspect to consider in performing a machine-learning experiment is the validation strategy. With the wrong kind of validation approach, biases can be introduced, and it may appear that an algorithm has more predictive ability than it has. Cross validation is a commonly used validation strategy that can help avoid such biases. In cross validation, the data instances are partitioned into "k" number of groups; in turn, each group is held separate ("test" instances), the algorithm derives a model using the remaining "training" instances, and the model is applied to the test instances. The algorithm's performance is evaluated by how well the predictions for the test instances coincide with the actual values being predicted. A common value for "k" is 10 (ten-fold cross validation). Other variations include 1) leave-one-out cross validation, in which "k" equals the number of data instances and 2) a simple training / testing split, in which the data are partitioned only once and only part of the data are ever used for testing. Another variation is to use "nested" cross-validation within each training set. In this approach, some form of cross validation is used to optimize the model before it is applied to the "outer" test set. Going a step further, many studies also repeat cross validation multiple times on the same data set. This allows them to assess the robustness of their findings as data instances are assigned differently (at random) to folds. While such validation strategies are extremely useful, they can also be computationally intensive, especially for large data sets. ML-Flex addresses this need by enabling analyses to be split across multiple "threads" on a single computer and/or across multiple computers, thus leading to substantially shorter execution times.

Cross validation must be applied carefully, or biases can be introduced. Training sets must never overlap with test sets (which can be especially tricky in nested cross validation). It is also important that model optimization (for example, applying feature-selection algorithms to identify the most relevant features) be applied only to training sets. Some algorithms are powerful enough to pick up on subtle variations in a data set and (falsely) attain perfect accuracy if the model is trained on the full data set. This phenomenon is a particularly harmful case of overfitting the data. The architecture of ML-Flex prevents such biases.

Machine-learning algorithms have been developed in a wide variety of programming languages and offer many incompatible ways of interfacing to them. ML-Flex makes it possible to interface with any algorithm that provides a command-line interface. This flexibility enables users to perform machine-learning experiments with ML-Flex as a harness while applying algorithms that may have been developed in different programming languages or that may provide different interfaces.

The various tutorials in this directory explain how to configure and run ML-Flex experiments. Please review them and notify the mailing list if you have any questions or problems.


Table of Contents

Introduction to ML-Flex

Prerequisites

Configuring Algorithms

Creating an Experiment File

List of Experiment Settings

Running an Experiment

List of Command-line Arguments

Executing Experiments Across Multiple Computers

Modifying Java Source Code

Creating a New Data Processor

Third-party Machine Learning Software

Integrating with Third-party Machine Learning Software

About Ensemble Learners