One of the main features of ML-Flex is that multiple cores/processors on a given computer and multiple computers on a given network can be employed simultaneously to execute a single experiment in parallel. This feature can be particularly useful when working with large data sets and/or when applying a large number of algorithms.
No ML-Flex configuration changes are necessary to execute an experiment in parallel. The sole hardware requirement is that ML-Flex execution files be located on a file system that can be accessed by all computers executing the experiment. This type of network topology is common on many cluster-computing environments. The rest of this tutorial will focus on a use case in which ML-Flex is installed on a cluster-computing environment and execution is shared across multiple computing nodes. Details may vary from one environment to another, but the concepts should be similar, regardless of the specific network configuration, as long as ML-Flex is installed on a file system that is shared by all computing nodes.
Many cluster-computing environments have "interactive" computing nodes that provide an interface for users to interact with the environment. The user logs into an interactive node (for example, via the ssh protocol), installs software, creates scripts to specify job execution tasks, submits jobs to a batch/queue system, and monitors jobs. After a job has been submitted, the batch/queue system distributes the job to one or more "compute" nodes as they become available. The compute nodes execute the job and typically store the results on the file system where they can later be accessed by the user via an interactive node. Interactive nodes typically have a similar configuration as the compute nodes and have access to the same file system(s) as the compute nodes.
One software product that is often used as a batch/queue system is Portable Batch System (PBS). (A user guide for this software can be downloaded here.) To create a processing job, users write a command-line script, save it to a file, and submit the script to PBS via the "qsub" command. The command-line script must contain certain information in a file header, which help PBS calculate the resources that will be required to execute the job, etc. (Please see Example PBS Job Script to get an idea for what a PBS script needs to look like.) Once a job has been submitted, users can monitor their job using the "qstat" command to see if it is currently waiting in the queue or executing.
To execute ML-Flex in parallel on a cluster-computing environment that uses PBS, various steps would be followed:
Step #5 can be repeated in succession as many times as the user desires. As nodes become available, they will begin executing the specified ML-Flex experiment. Note that even though it is typically possible to indicate in a PBS script that multiple compute nodes should execute a single job, it would not make sense to use this on ML-Flex. Rather, it is best to indicate that a single node should be used for each job and to submit multiple jobs.
List of Command-line Arguments
Executing Experiments Across Multiple Computers
Third-party Machine Learning Software