Image showing Oppo mobile phone
Case Study

OPPO

How CubeFS Accelerates AI training 3x in OPPO Hybrid Cloud Platform

Challenge

In 2021, Oppo built a one-stop service for a large-scale machine learning platform based on the distributed file system CubeFS. In 2022, to optimize costs, they began to build a hybrid GPU cloud on machine learning platforms. While the majority of the GPUs computing power remained in Oppo’s private cloud, when its resources were insufficient tasks would be scheduled to the public GPU cloud for training. This hybrid GPU-cloud architecture presented several challenges in CubeFS storage, including storage performance, lowered elasticity and capacity, security issues, and high costs. 

Solution

To tackle these challenges, Oppo turned to AI training – a process of repeated, multiple epoch reads of a dataset. Enabling the caching function of CubeFS’s client, the training data on CubeFS is cached to the GPU computing node side. This reduces CubeFS’ metadata latency to the microsecond, and data latency down to the hundred-microsecond level – accelerating AI training.

Impact

This approach has helped the machine learning platform realize unified storage based on hybrid cloud GPU resources. Not only does this provide transparency for business users, with consistent performance, users also don’t need to worry about separate dataset copies.

Plus, with cache acceleration enabled, leveraging the RESNET18 model (a classic and universal algorithm model in the field of artificial intelligence) improves performance by up to 360%. Meanwhile, using the AlexNet model improves performance by up to 130%.

Location:
Cloud Type:
Published:
March 28, 2023

Projects used

By the numbers

360% improvement

On performance with cache acceleration

2 minutes

To sync 10 million files

100 billion training models

Stored in machine learning platform

Oppo is China’s largest smartphone manufacturer, holding the fifth largest market share of smartphone manufacturers globally, with devices sold in more than 200,000 retail outlets.

In 2021, AI pioneers Oppo built a large-scale, end-to-end machine learning platform, leveraging CubeFS, to support AI training for more than 100 Oppo businesses – running tens of thousands of training tasks every day for business fields including voice assistant, commercial advertisement, OPPO game, ctag multi-mode, NLP and more.

However, due to network latency (2ms) from the public cloud to the private cloud, the storage performance of the public GPU cloud host accessing remote CubeFS decreased sharply for metadata operations, data IO, latency-sensitive small file training datasets and large file training datasets in lmdb format. 

By comparing the training tasks of the public GPU cloud and private GPU cloud, Oppo found that the image samples processed by a single GPU card per epoch were 2-4 times different. At the same time, the IO bottleneck led to low utilization of GPU cards, wasting task training time and GPU resources.

Diagram showing Oppo's private cloud architecture

Why CubeFS?

CubeFS is a cloud-native storage platform hosted by the Cloud Native Computing Foundation as an incubating project. It is commonly used as the storage infrastructure for online applications, database, data processing services, and machine learning jobs orchestrated by Kubernetes. One of the advantages is that it separates storage from computing, which can scale up or down based on the workload independently, providing total flexibility in matching resources to the actual storage and compute capacity required at any given time.

Diagram showing CubeFS architecture

CubeFS consists of a metadata subsystem, data subsystem, and resource manager – which can be accessed by different clients (as a set of application processes) hosted on the containers through different file system instances called volumes. CubeFS features:

Elastic capacity and security

To solve Oppo’s hybrid cloud storage performance problem, AI training tasks are scheduled to the public cloud in the initial stages. However, this requires users to migrate CubeFS data in the Oppo private cloud to public cloud storage, meaning datasets face issues with security compliance, leading to a poor user experience and high costs to address security issues individually, because machine learning platforms cannot schedule GPU computing power with flexibility.

To tackle these issues, Oppo found a number of innovative data-caching solutions:

Data cache

The purpose of data acceleration is to cache the data extent of a file to the local disk. Due to disk resource limitations, multiple training tasks need to share cache disk space. The data cache acts as a separate data read and write service that manages cached data, which provide interfaces: read operation, write operation, delete operation, and GC of cached data.

cache layout:

A file in CubeFS consists of data extents. To facilitate the management of cached file data, a full extent is cached and an extent shard key naming convention: volume name_inodeId_extentId_fileoffset in the file uniquely identifies an extent. The location of the extent on the cache directory is computed by hash rules.

Diagram showing Oppo's cache layout

Cache consistent

Although the AI training is a write-less/read-more process, it is still necessary to ensure cache consistency for some modified files:

Diagram showing how Oppo ensures cache consistency

What’s next for Oppo?

Today, CubeFS is used as the storage engine of Oppo’s machine learning platform, which supports the scale of tens of billions of business, and hundreds of billions of training models in terms of service availability, performance and scalability.

Because CubeFS is based on a client-side, caching solution to pull remote data to computing nodes for storage acceleration, to further expand the current use case, training datasets can be stored in the lower-cost HDD CubeFS cluster. When performing AI training, the client-side cache acceleration is enabled, which can better balance storage cost and performance.

In the future, Oppo plans to further utilize CubeFS to realize the elastic scheduling capability of the GPUs computing engine in the multi-private cloud scenario.