Accelerate the End-to-End Machine Learning Training Pipeline by Optimizing I/O

This article is the first in a series introducing the architecture and solution to accelerate machine learning model training. The next article compares traditional solutions and explains how this new approach differs. 

Background: The Unique Requirements of AI/ML Model Training

With artificial intelligence (AI) and machine learning (ML) becoming more pervasive and business-critical, organizations are advancing their AI/ML capabilities and broadening the use and scalability of AI/ML applications. These AI/ML applications require data platforms to meet the following specific requirements:

I/O Efficiency for Frequent Data Access to Massive Number of Small Files

AI/ML pipelines are composed of not only model training and inference but also include data loading and preprocessing steps as a precursor, which have a major impact. In the data loading and preprocessing phase, AI/ML workloads often make more frequent I/O requests to a larger number of smaller files than traditional data analytics applications. Having a better I/O efficiency can dramatically increase the speed of the entire pipeline.

Higher GPU Utilization to Reduce Costs and Increase ROI

Model training is compute-intensive and requires GPUs to enable data to be processed quickly and accurately. Because GPUs are expensive, optimal utilization is critical. However, when utilizing GPUs, I/O becomes the bottleneck – workloads are bound by how fast data can be made available to the GPUs and not how fast the GPUs can perform training calculations. Data platforms need high throughput and low latency to fully saturate the GPU clusters to reduce the cost.

Natively Connect to Diverse Storage System

As the volume of data keeps growing, it gets more and more challenging for organizations to only have a single storage system. A variety of storage options are being used across business units, including on-premises distributed storage systems (HDFS, Ceph) and cloud storage (AWS S3, Azure Blob Store, Google Cloud Storage). Having access to all the training data spanning in different environments is necessary to make models more effective.

Support for the Cloud, Hybrid and Multi-Cloud infrastructure

In addition to supporting different storage systems, data platforms also need to support different deployment models. As the volume of data grows, cloud storage has become a popular choice with high scalability, reduced cost, and ease of use. Organizations want flexibility and openness to training models by leveraging available cloud, hybrid, and multi-cloud infrastructure. Also, the growing trend of separating compute resources from storage resources necessitates using remote storage systems. However, when storage systems are remote, data must be fetched over the network, bringing performance challenges. Data platforms need to achieve high performance while accessing data across heterogeneous environments.

In summary, today’s AI/ML workloads demand fast access to expansive amounts of data at a low cost in heterogeneous environments. Organizations need to modernize the data platform to enable those workloads to effectively access data, maintain high throughput and utilize the GPU resources used to train models.

Solution and Architecture Overview

Alluxio is an open-source data orchestration platform for analytics and machine learning applications. Alluxio not only provides a distributed caching layer between training jobs and underlying storage but also is responsible for connecting to the under storage, fetching data proactively or on-demand, caching data based on user-specified policy, and feeding data to the training frameworks.

The typical architecture is shown below:

Solution and architecture overview

You can use Alluxio to accelerate machine learning and deep learning training includes the following three steps:

  1. Deploy Alluxio on the training cluster
  2. Mount Alluxio as a local folder to training jobs
  3. Load data from local folders (backed by Alluxio) using a training script

Here, we discuss the basic features and benefits of this architecture.

Faster Access to Shared Data Using Caching

By caching data locally or closer to the training jobs, Alluxio provides high I/O throughput, which prevents unnecessarily low GPU utilization while waiting for fetching data.

Instead of duplicating the entire dataset into every single machine, Alluxio implements a shared distributed caching service, where data can be evenly distributed across the cluster. This can greatly improve storage utilization especially when the training dataset is much larger than the storage capacity of a single node. This is shown in the following figure:

In this figure, the whole dataset is stored in an object-store. To simplify the scenario, two files (/path1/file1 and /path2/file2) are chosen to represent this dataset. Small file e.g. /path1/file1 is stored as a single block in Alluxio while big file e.g. /path2/file2 is stored as multiple blocks in Alluxio. Instead of storing all the file blocks on each of the training machines, blocks will be stored distributedly in multiple machines. In this figure, Block A and B are stored in server one while block C is stored in server two. Each block can have multiple copies stored in multiple servers to prevent data loss and improve read concurrency. Block A is stored in both server A and server C.

To load data into cache in a distributed manner, users can choose to preload data manually, or dynamically cache data by applications on demand. 

Approach 1: Distributed Preload

Users can use the following command lines to mount an s3 bucket to Alluxio and load all the data in this s3 bucket to the Alluxio cache. Preloading will be done by multiple nodes and multiple threads in each node distributedly and concurrently to speed up the loading.

$ bin/alluxio fs mount /s3 s3://<bucket_name>/

  –option aws.accessKeyId=<access_key>

  –option aws.secretKey=<secret_key>

$ bin/alluxio fs distributedLoad /s3

The mount operation mounts an S3 bucket s3://<bucket_name> to the path /s3 in Alluxio namespace with AWS credentials. After this step, users can interact with data stored in the S3 bucket via Alluxio APIs targeting the Alluxio path /s3. For example, users can run bin/alluxio fs ls /s3 to list all the files or directories inside the S3 bucket.

The distributedLoad operation loads all the files stored in s3://<bucket_name> from under-storage system S3 into Alluxio storage distributed across Alluxio workers. All the following read operations targeting the S3 bucket via Alluxio will read from the Alluxio cache instead of reading from S3.

Approach 2: Dynamically Cache Data During Training

Alternatively, instead of waiting for the data preload before running the training, Alluxio allows training to start immediately while preloading data in the background. 

Training can start without loading any data into Alluxio and relies on Alluxio to auto-cache those required data. Preloading data in the background during training is optional and may help speed up the data access.

Dynamically caching data during training

By dynamically caching data into Alluxio during training, the I/O request to under storage will be dispersed to a longer time period to reduce the chance of exceeding the storage system’s request limit. In addition, unlike the distributedLoad command which caches the whole dataset, only needed hot data will be loaded and cached using the dynamic read and cache approach. Together with preloading and precaching data, GPU idle time will be largely reduced and GPU utilization rate during training will be increased.

Faster Metadata Operations With Consistency

Today, training models on millions or tens of millions of small files (e.g., images or video clips) is common. In this case, directly retrieving the metadata of a massive amount of files or directories becomes the performance bottleneck, and increases the cost greatly. 

Alluxio provides its own metadata service by caching and acting as the proxy of the cloud storage for metadata operations requested by the training. In case the data is modified out of the band at the cloud storage, Alluxio has the option to periodically synchronize and make sure the metadata is eventually consistent with the data source.

Advanced-Data Management Policies

Alluxio provides a set of built-in data management policies to help manage the storage resource for data cache efficiently.

Pluggable and transparent data replacement policies. Data cached in Alluxio storage space can be evicted and replaced by hotter data automatically when the cache is full. As a result, the cache capacity can be reused across different training runs and no explicit purge or curation is required.

Pin and free working set in cache. Alluxio cache may not be enough to host the whole training set. When the training job needs to load the same large data set multiple times, it is meaningless to keep freeing the dataset and re-caching. In this case, pin command can help prevent freeing the cache.

For example, a user who has 200TB training data wants to use Alluxio to cache 60% of the whole training dataset 120TB, and the remaining 40% 80TB can always go to the UFS directly. By pinning the dataset, the 120TB cached dataset will not be freed during training.

$ alluxio fs pin /not/evict/data

If a dataset is not used anymore, instead of waiting for the automatic data replacement policies, free command can help delete all the file cache for this dataset in Alluxio storage.

$ alluxio fs free /unused/data

Setting TTL (time-to-live) of data in cache. One can set the time-to-live of a file or a directory.

$ alluxio fs setTtl /data/good-for-one-day 86400000

After 1 day, the file will be deleted from Alluxio.

Data Replication. More data replications can provide more concurrency of reading this data. Alluxio supports explicitly setting data replication numbers by users and dynamically adding data replications when high read requests target the data.

Support Multiple APIs and Frameworks

Alluxio supports multiple APIs like the HDFS interface that is popular for big data applications e.g. Spark, Presto. The new POSIX interface is popular for training applications like PyTorch, Tensorflow, and Caffe. 

Multiple APIs like the HDFS interface

Users can use big data frameworks to preprocess data, store in the Alluxio, asynchronously persist to under storage, and at the same time use training frameworks to train models with preprocessed data in Alluxio. 

The Evolution and Real-World Use Cases

Alluxio POSIX interface was an experimental feature first introduced in Alluxio v1.0 in February 2016, contributed by researchers from IBM. In the early days, this feature was mostly used by legacy applications that did not support the HDFS interface, evolving slowly in functionalities and performance. In early 2020, the surge of machine learning training workloads started to guide and drive the production readiness of this feature. 

Particularly, the Alluxio core team has been working closely with our open-source community including engineers from Alibaba, Tencent, Microsoft, and researchers from Nanjing University. With the strong and efficient collaboration, hundreds of issues are addressed and major improvements introduced (e.g., JNI-based FUSE connector, Container Storage Interface, performance optimization on many small files).

Here we demonstrate some real-world use cases and the architecture of Alluxio in production at the most data-intensive companies in the world.

Boss Zhipin

Boss Zhipin (NASDAQ: BZ) is the largest online recruitment platform in China. At Boss Zhipin, Alluxio is used as the caching storage for underlying Ceph and HDFS. Spark and Flink read data from Alluxio, preprocess the data, and then write back to Alluxio cache. In the backend, Alluxio persists the preprocessed data back to Ceph under storage and HDFS. Without waiting for writing to Ceph or HDFS to finish, training applications like Tensorflow, PyTorch, and other Python applications can read preprocessed data from Alluxio for training.


Tencent deployed Alluxio in a cluster of 1000 nodes to accelerate the data preprocessing of model training on the game AI platform. The introduction of Alluxio significantly increased the concurrency of AI workloads at a low cost. There is no change to the applications since they are using POSIX API to access Alluxio. 

Microsoft Bing

The Microsoft Bing team used Alluxio to speed up large-scale ML/DL offline inference. By implementing Alluxio, they are able to speed up the inference job, reduce I/O stall, and improve performance by about 18%. 


Unisound is an AI company focusing on IoT services. Unisound used Alluxio as a cache layer between computing and storage in its GPU/CPU heterogeneous computing and distributed file system. With Alluxio, users can enjoy fast data access because Alluxio brings the underlying storage to memory or local hard drives on each computing node. The entire platform utilizes the resources of both the distributed file system and the local hard disk. 


Momo (NASDAQ: MOMO) is the company that develops the popular mobile application that connects people and facilitates social interactions. Momo has multiple Alluxio clusters in production with thousands of Alluxio nodes. Alluxio has helped Momo to accelerate compute/training tasks and reduce the metadata and data overhead of under storage of billions of image training.

How to Get Started

In this article, we outlined Alluxio’s architecture and solution to improve training performance and simplify data management. In summary, it provides the following benefits:

  • Distributed caching to enable data sharing between tasks and nodes. 
  • Unified data access across multiple data sources. 
  • End-to-end pipeline acceleration spanning ingestion, data preprocessing, and training.
  • Advanced data management policies including data preloading and data pinning to further simplify data management and improve performance.

To get started with model training with Alluxio, visit this documentation for more details.


Leave a Comment