What is Inference Parallelism and how it works

Posted on December 18, 2024

Member post originally published on the InfraCloud blog by Aman Juneja, Principal Solutions Engineer at InfraCloud Technologies

In recent years, we’ve witnessed two recurring trends: the release of increasingly powerful GPUs and the introduction of Large Language Models (LLMs) with billions or trillions of parameters and expansive context windows. Many businesses are leveraging these LLMs by fine-tuning them or building out apps with domain-specific knowledge using RAG and deploying them on dedicated GPU servers. Now, when it comes to deploying these models on GPU, one thing to notice is the model size, i.e., the space required (for storing the parameters and context tokens) to load the model into the GPU memory is too high compared to the memory available on the GPU.

Model Size vs GPU Memory

There are methods to reduce model sizes by using optimization techniques like quantization, pruning, distillation & compression, etc. But if you notice in the below comparison table between the latest GPU memory and space requirement for 70B models (FP16 quantized), it’s almost impossible to handle multiple requests at a time, or in some GPUs, the model will not even fit on the memory.

GPU	FP16 (TFLOPS) with sparsity	GPU Memory (GB)
B200	4500	192
B100	3500	192
H200	1979	141
H100	1979	80
L4	242	24
L40S	733	48
L40	362	48
A100	624	80

This is all with already applied FP16 quantization that incurs some loss of precision (which is usually acceptable in many of the generic use cases).

Models	KV Cache in GB for Parameters (FP16)
llama3-8B	16
llama3-70B	140
llama-2-13B	26
llama2-70B	140
mistral-7B	14

This brings us to the context of this blog post, i.e. how enterprises run large billion or trillion parameters LLM models on these modern datacenter GPUs. Are there any ways to split these models into smaller pieces and run only what is required at the moment, or can we distribute the parts of the model into different GPUs? I will try to answer these questions in this blog post with the current set of methods available to perform inference parallelization and also will try to highlight some of the tools/libraries that support these methods of parallelization.

Inference parallelism

Inference parallelism aims to distribute the computational workload of AI models, particularly deep learning models, across multiple processing units such as GPUs. This distribution allows for faster processing, reduced latency, and the ability to handle models that exceed the memory capacity of a single device.

Four primary methods have been developed to achieve inference parallelism, each with its strengths and applications:

Data Parallelism
Tensor Parallelism
Pipeline Parallelism
Expert Parallelism

Data parallelism

In data parallelism, we deploy multiple copies of models on different GPUs or GPU clusters. Each copy of the model independently processes the user request. In a simple analogy, this is like having multiple replicas of 1 microservice.

Now, a common question one might have is how it solves the problem of model size fitting into GPU memory, which we discussed at the start, and the short answer is that it doesn’t. This method is only recommended for smaller models that can fit into the GPU memory. In those cases, we can use multiple copies of the model deployed on different GPU instances and distribute the requests to different instances hence providing enough GPU resources for each request and also increasing the availability of the service. This will also increase the overall request throughput for the system as you have more instances to handle the traffic now.

Tensor parallelism

In tensor parallelism, we split each layer of the model across different GPUs. A single user request will be shared across multiple GPUs and the result of each request’s GPU computations will be recombined over a GPU-to-GPU network.

To understand it better, as the name suggests, we split the tensors into chunks along a particular dimension such that each device only holds 1/N chunk of the tensor. Computation is performed using this partial chunk to get partial output. These partial outputs are collected from all devices and then combined.

As you might have noticed already, the bottleneck to the performance of tensor parallelism is the speed of the network between GPU-to-GPU. As each request will be computed across different GPUs and then combined, we need a high-performance network to ensure low latency numbers.

Pipeline parallelism

In pipeline parallelism, we distribute a group of model layers across different GPUs. Layer-based partitioning is the fundamental approach in pipeline parallelism. The model’s layers are grouped into continuous blocks, forming stages. This partitioning is typically done vertically through the network’s architecture. Computational balance is a key consideration. Ideally, each stage should have an approximately equal computational load to prevent bottlenecks. This often involves grouping layers of varying complexities to achieve balance. Memory usage optimization is another critical factor. Stages are designed to fit within the memory constraints of individual devices while maximizing utilization. Communication overhead minimization is also important. The partitioning aims to reduce the amount of data transferred between stages, as inter-device communication can be a significant performance bottleneck.

So for example, if you are deploying LLaMA3-8B model that has 32 layers on a 4 GPU instance, you can split and distribute 8 layers of model on each GPU. The processing of requests happens in a sequential manner where the computation starts at one GPU and continues to the next GPU with point-to-point communication.

Again, as multiple GPU instances are involved, the networking can become a huge bottleneck if we do not have high-speed network communication between the GPUs.This parallelism can increase the GPU throughput as every request will need fewer resources from each GPU and should be easily available, but it will end up increasing the overall latency as the request will be processed sequentially, and delay in any GPU computation or network component will cause an overall surge in latency.

Model parallelism layer wise — (Image Source)

Expert parallelism

Expert parallelism, often implemented as a Mixture of Experts (MoE), is a technique that allows for the efficient use of large models during the inference process. It doesn’t solve the problem of fitting the models into the GPU memory but provides an option to have a broad capability model serving the requests based on the request context. In this technique, the model is divided into multiple expert sub-networks. Each expert is typically a neural network trained to handle specific types of inputs or subtasks within the broader problem domain. A gating network determines which expert to use for each input. Only a subset of experts is activated for any given input. Different experts can be distributed across different GPUs. Router/Gating network and active experts can operate in parallel. Inactive experts don’t consume computational resources. This greatly reduces the number of parameters that each request must interact with, as some experts are skipped. But like Tensor & Pipeline parallelism the overall request latency relies heavily on the GPU-to-GPU communication network. A request must be reconstituted back to their original GPUs after expert processing generating high networking communication over the GPU-to-GPU interconnect fabric.

This approach can lead to better utilization of hardware compared to Tensor parallelism as you don’t have to split the operations into smaller chunks.

Following is a summary and comparison of the methods we discussed. You can use it as a reference when planning to choose one for your use case.

Aspect	Data Parallelism	Tensor Parallelism	Pipeline Parallelism	Expert Parallelism
Basic Concept	Splits input data across multiple devices	Splits individual tensors/layers across devices	Splits model into sequential stages across devices	Splits model into multiple expert sub-networks
How it Works	The same model is replicated on each device, processing different data chunks	Single layer/operation distributed across multiple devices	Different parts of the model pipeline on different devices	The router selects specific experts for each input
Parallelization Unit	Batch of inputs	Individual tensors/layers	Model stages	Experts (sub-networks)
Scalability	Scales well with batch size	Scales well for very large models	Scales well for deep models	Scales well for wide models
Memory Efficiency	Low (full model on each device)	High (only part of each layer on each device)	High (only part of the model on each device)	Medium to High (experts distributed across devices)
Communication Overhead	Low	Medium to High	Low (only between adjacent stages)	Medium (router communication and expert selection)
Load Balancing	Generally balanced if data is evenly distributed	Balanced within operations	Can be challenging, and requires careful stage design	Can be challenging, and requires effective routing
Latency	Low for large batches	Can increase for small batches	Higher due to pipeline depth	Can be low if routing is efficient
Throughput	High for large batches	Can be high for large models	High, especially for deep models	Can be very high for diverse inputs
Typical Use Cases	Large batch inference, embarrassingly parallel tasks	Very large models that don’t fit on a single device	Deep models with sequential dependencies	Models with diverse sub-tasks or specializations
Challenges	Limited by batch size, high memory usage	Complex implementation, potential communication bottlenecks	Pipeline bubble, difficulty in optimal stage partitioning	Load balancing, routing overhead, training instability
Adaptability to Input Size	Highly adaptable	Less adaptable, fixed tensor partitioning	Less adaptable, fixed pipeline	Highly adaptable, different experts for different inputs
Suitable Model Types	Most model types	Transformer-based models, very large neural networks	Deep sequential models	Multi-task models, language models with diverse knowledge
Supported Inference Backends	TensonRT-LLM, vLLM, TGI	TensonRT-LLM, vLLM, TGI	TensonRT-LLM, vLLM, TGI	TensonRT-LLM, vLLM, TGI

Parallelism techniques combined

By now, you might already be thinking that if using the above parallelism methods means we are reducing the overall consumption or utilization of GPU, then can we not combine or replicate these to increase the overall GPU throughput? Combining Inference parallelism methods can lead to more efficient and scalable systems, especially for large and complex models.

In the table below, you can see 4 possible options but in actual scenarios based on the number of GPUs you have, this combination can grow to a very large number.

Data Parallelism (DP) + Pipeline Parallelism (PP)	Tensor Parallelism (TP) + Pipeline Parallelism (PP)	Expert Parallelism (EP) + Data Parallelism (DP)	Tensor Parallelism (TP) + Expert Parallelism (EP)
Split the model into stages (pipeline parallelism)	Divide the model into stages (pipeline parallelism)	Distribute experts across devices (expert parallelism)	Split large expert models across devices (tensor parallelism)
Replicate each stage across multiple devices (data parallelism)	Split large tensors within each stage across devices (tensor parallelism)	Process multiple inputs in parallel (data parallelism)	Split large expert models across devices (tensor parallelism)

So for example let’s say you have 64 GPU available and you are planning to deploy llama3-8b or Mistral 8*7B model on them. The following are some of the possible combinations of the parallelism methods. These are just examples to understand the parallelism combination strategies, for the actual use case you need to consider and benchmark other options as well.

LLaMA 3-8B (64 GPUs)

Strategy	GPU Allocation	Pros	Cons
PP8TP8	64 = 8 (pipeline) × 8 (tensor)	– Balanced distribution – Reduced Communication	– Pipeline bubbles – Increased latency
PP4TP4DP4	64 = 4 (pipeline) × 4 (tensor) × 4 (data)	– Higher throughput – Flexible for batch sizes	– Complex integration – Requires large batches
DP8TP8	64 = 8 (data) × 8 (tensor)	– Higher throughput – No pipeline bubbles	– Large batch size needed – High memory per GPU

Mistral 8*7B (64 GPUs)

Strategy	GPU Allocation	Pros	Cons
EP8TP8	64 = 8 (experts) × 8 (tensor per expert)	– Balanced memory distribution – Efficient memory use	– Complex load balancing – Potential compute underutilization
EP8TP4DP2	64 = 8 (experts) × 4 (tensor) × 2 (data)	– Higher throughput – Balanced utilization	– Needs careful load balancing – Large batch sizes required
EP8PP2TP4	64 = 8 (experts) × 2 (pipeline) × 4 (tensor)	– Supports deeper experts – Flexible scaling	– Increased latency – Complex synchronization

How to choose one?

Now we have covered 4 methods of Inference parallelism, and then we have multiple combinations of these methods, so a common question you might be having is how do you choose or identify which method to use. So, the choice of inference parallelism methods broadly depends on the following factors:

Model architecture
Use case requirements (Latency vs Throughput)
Hardware configuration

Model architecture

Model architecture plays a crucial role in determining the most effective inference parallelism strategy. You need to identify your model architecture and then choose the parallelism method or combination of them that fits well. Different model structures fit to different parallelization techniques:

Large, deep models (e.g., GPT-4, PaLM 2):
- Benefit from pipeline parallelism
- Can be split into stages across multiple GPUs
- Good for models with many sequential layers
Wide models with large layer sizes (LLaMA 2, BLOOM):
- Ideal for tensor parallelism
- Allow splitting individual layers across GPUs
- Effective for models with very large matrix operations
Ensemble models or Mixture of Experts (Mixtral 8x7B, Switch Transformers):
- Well-suited for expert parallelism
- Different experts can be distributed across GPUs
- Useful for models that use different sub-networks for various tasks
Models with small computational graphs (GPT-3.5-turbo, Falcon-7B):
- Often works well with simple data parallelism
- Can replicate the entire model across GPUs
- Effective when the model fits entirely in GPU memory
Attention-based models (e.g., Transformers):
- Can benefit from attention slicing or multi-head parallelism
- Allow distributing attention computations across GPUs

Use case requirements (Latency vs Throughput tradeoff)

You need to be familiar with your business requirements or use case, i.e., what matters to business more, the latency of the user requests, or utilization of the GPU. Or can you identify the mode of your application i.e., is it a real-time application where response time to the user request is the main driver for the user experience, or is it an offline system where response time to the user request is not the primary concern

The reason we need to be aware of this is the tradeoff between latency and GPU throughput. If you want to reduce the latency to the user request you need to allocate more GPU resources to each request and your choice of parallelism method will rely on that. You can do optimal batching so that requests are not struggling to acquire the GPU resource which increases the overall latency.

However, if latency is not the consideration, then your goal should be to achieve maximum throughput on the GPU by choosing the right parallelism method and appropriate batch sizes that utilize GPU resources to its maximum throughput.

Use Case	Preferred Parallelism	Latency vs Throughput Consideration
Real-time chatbots	Data parallelism or Tensor parallelism	Low latency priority; moderate throughput
Batch text processing	Pipeline parallelism	High throughput priority; latency less critical
Content recommendation	Expert parallelism	High throughput for diverse inputs; moderate latency
Sentiment analysis	Data parallelism	High throughput; latency less critical for bulk processing
Voice assistants	Tensor parallelism or Pipeline parallelism	Very low latency priority; moderate throughput

Hardware configuration

Hardware configuration often dictates the feasible parallelism strategies, and the choice should be optimized for the specific inference workload and model architecture. Following are some hardware component choices that impact the overall parallelism choices.

Hardware Component	Impact on Parallelism Choice	Examples
GPU Memory Capacity	Determines feasibility of data parallelism and influences the degree of model sharding	NVIDIA A100 (80GB) allows larger model chunks than A100 (40GB)
Number of GPUs	Affects the degree of parallelism possible across all strategies	8 GPUs enable more parallelism than 4 GPUs
GPU Interconnect Bandwidth	Influences efficiency of tensor and pipeline parallelism	NVLink offers higher bandwidth than PCIe, benefiting tensor parallelism
CPU Capabilities	Impacts data preprocessing and postprocessing in parallelism strategies	High-core count CPUs can better handle data parallelism overhead
System Memory	Affects the ability to hold large datasets for data parallelism	1TB system RAM allows larger batch sizes than 256GB
Storage Speed	Influences data loading speeds in data parallelism	NVMe SSDs provide faster data loading than SATA SSDs
Network Bandwidth	Critical for distributed inference across multiple nodes	Networks based on Infiniband, RoCE are faster than conventional networks for GPU fabrics
Specialized Hardware	Enables specific optimizations	Google TPUs are optimized for tensor operations

Conclusion

AI inference parallelism is a game-changer for running big AI models efficiently. We’ve looked at different ways to split up the work, like data parallelism, tensor parallelism, pipeline parallelism, and expert parallelism. Each method has its own pros and cons, and choosing the right one depends on your specific needs and setup. It’s exciting to see tools like TensorRT-LLM, vLLM, and Hugging Face’s Text Generation Inference making these advanced techniques easier for more people to use. As AI models keep getting bigger and more complex, knowing how to use these parallelism techniques will be super important. They’re not just about handling bigger models – they’re about running AI smarter and more efficiently. By using these methods well, we can do amazing things with AI, making it faster, cheaper, and more powerful for all kinds of uses.

The future of AI isn’t just about bigger models; it’s about finding clever ways to use them in the real world. With these parallelism techniques, we’re opening doors to AI applications that were once thought impossible. If you’re looking for experts who can help you scale or build your AI infrastructure, reach out to our AI & GPU Cloud experts.

If you found this post valuable and informative, subscribe to our weekly newsletter for more posts like this. I’d love to hear your thoughts on this post, so do start a conversation on LinkedIn.

Hong Kong