Member post originally published on the InfraCloud blog by Aman Juneja, Principal Solutions Engineer at InfraCloud Technologies

In recent years, we’ve witnessed two recurring trends: the release of increasingly powerful GPUs and the introduction of Large Language Models (LLMs) with billions or trillions of parameters and expansive context windows. Many businesses are leveraging these LLMs by fine-tuning them or building out apps with domain-specific knowledge using RAG and deploying them on dedicated GPU servers. Now, when it comes to deploying these models on GPU, one thing to notice is the model size, i.e., the space required (for storing the parameters and context tokens) to load the model into the GPU memory is too high compared to the memory available on the GPU.

Model Size vs GPU Memory

There are methods to reduce model sizes by using optimization techniques like quantization, pruning, distillation & compression, etc. But if you notice in the below comparison table between the latest GPU memory and space requirement for 70B models (FP16 quantized), it’s almost impossible to handle multiple requests at a time, or in some GPUs, the model will not even fit on the memory.

GPUFP16 (TFLOPS)
with sparsity
GPU Memory
(GB)
B2004500192
B1003500192
H2001979141
H100197980
L424224
L40S73348
L4036248
A10062480

This is all with already applied FP16 quantization that incurs some loss of precision (which is usually acceptable in many of the generic use cases).

ModelsKV Cache in GB for Parameters (FP16)
llama3-8B16
llama3-70B140
llama-2-13B26
llama2-70B140
mistral-7B14

This brings us to the context of this blog post, i.e. how enterprises run large billion or trillion parameters LLM models on these modern datacenter GPUs. Are there any ways to split these models into smaller pieces and run only what is required at the moment, or can we distribute the parts of the model into different GPUs? I will try to answer these questions in this blog post with the current set of methods available to perform inference parallelization and also will try to highlight some of the tools/libraries that support these methods of parallelization.

Inference parallelism

Inference parallelism aims to distribute the computational workload of AI models, particularly deep learning models, across multiple processing units such as GPUs. This distribution allows for faster processing, reduced latency, and the ability to handle models that exceed the memory capacity of a single device.

Four primary methods have been developed to achieve inference parallelism, each with its strengths and applications:

Data parallelism

In data parallelism, we deploy multiple copies of models on different GPUs or GPU clusters. Each copy of the model independently processes the user request. In a simple analogy, this is like having multiple replicas of 1 microservice.

Now, a common question one might have is how it solves the problem of model size fitting into GPU memory, which we discussed at the start, and the short answer is that it doesn’t. This method is only recommended for smaller models that can fit into the GPU memory. In those cases, we can use multiple copies of the model deployed on different GPU instances and distribute the requests to different instances hence providing enough GPU resources for each request and also increasing the availability of the service. This will also increase the overall request throughput for the system as you have more instances to handle the traffic now.

Data parallelism
(ImageSource)

Tensor parallelism

In tensor parallelism, we split each layer of the model across different GPUs. A single user request will be shared across multiple GPUs and the result of each request’s GPU computations will be recombined over a GPU-to-GPU network.

To understand it better, as the name suggests, we split the tensors into chunks along a particular dimension such that each device only holds 1/N chunk of the tensor. Computation is performed using this partial chunk to get partial output. These partial outputs are collected from all devices and then combined.

Tensor parallelism
(Image Source)

As you might have noticed already, the bottleneck to the performance of tensor parallelism is the speed of the network between GPU-to-GPU. As each request will be computed across different GPUs and then combined, we need a high-performance network to ensure low latency numbers.

Model parallelism
(Image Source)

Pipeline parallelism

In pipeline parallelism, we distribute a group of model layers across different GPUs. Layer-based partitioning is the fundamental approach in pipeline parallelism. The model’s layers are grouped into continuous blocks, forming stages. This partitioning is typically done vertically through the network’s architecture. Computational balance is a key consideration. Ideally, each stage should have an approximately equal computational load to prevent bottlenecks. This often involves grouping layers of varying complexities to achieve balance. Memory usage optimization is another critical factor. Stages are designed to fit within the memory constraints of individual devices while maximizing utilization. Communication overhead minimization is also important. The partitioning aims to reduce the amount of data transferred between stages, as inter-device communication can be a significant performance bottleneck.

So for example, if you are deploying LLaMA3-8B model that has 32 layers on a 4 GPU instance, you can split and distribute 8 layers of model on each GPU. The processing of requests happens in a sequential manner where the computation starts at one GPU and continues to the next GPU with point-to-point communication.

Again, as multiple GPU instances are involved, the networking can become a huge bottleneck if we do not have high-speed network communication between the GPUs.This parallelism can increase the GPU throughput as every request will need fewer resources from each GPU and should be easily available, but it will end up increasing the overall latency as the request will be processed sequentially, and delay in any GPU computation or network component will cause an overall surge in latency.

Model parallelism layer wise
(Image Source)

Expert parallelism

Expert parallelism, often implemented as a Mixture of Experts (MoE), is a technique that allows for the efficient use of large models during the inference process. It doesn’t solve the problem of fitting the models into the GPU memory but provides an option to have a broad capability model serving the requests based on the request context. In this technique, the model is divided into multiple expert sub-networks. Each expert is typically a neural network trained to handle specific types of inputs or subtasks within the broader problem domain. A gating network determines which expert to use for each input. Only a subset of experts is activated for any given input. Different experts can be distributed across different GPUs. Router/Gating network and active experts can operate in parallel. Inactive experts don’t consume computational resources. This greatly reduces the number of parameters that each request must interact with, as some experts are skipped. But like Tensor & Pipeline parallelism the overall request latency relies heavily on the GPU-to-GPU communication network. A request must be reconstituted back to their original GPUs after expert processing generating high networking communication over the GPU-to-GPU interconnect fabric.

This approach can lead to better utilization of hardware compared to Tensor parallelism as you don’t have to split the operations into smaller chunks.

MoE LLM
(Image Source)

Following is a summary and comparison of the methods we discussed. You can use it as a reference when planning to choose one for your use case.

AspectData ParallelismTensor ParallelismPipeline ParallelismExpert Parallelism
Basic ConceptSplits input data across multiple devicesSplits individual tensors/layers across devicesSplits model into sequential stages across devicesSplits model into multiple expert sub-networks
How it WorksThe same model is replicated on each device, processing different data chunksSingle layer/operation distributed across multiple devicesDifferent parts of the model pipeline on different devicesThe router selects specific experts for each input
Parallelization UnitBatch of inputsIndividual tensors/layersModel stagesExperts (sub-networks)
ScalabilityScales well with batch sizeScales well for very large modelsScales well for deep modelsScales well for wide models
Memory EfficiencyLow (full model on each device)High (only part of each layer on each device)High (only part of the model on each device)Medium to High (experts distributed across devices)
Communication OverheadLowMedium to HighLow (only between adjacent stages)Medium (router communication and expert selection)
Load BalancingGenerally balanced if data is evenly distributedBalanced within operationsCan be challenging, and requires careful stage designCan be challenging, and requires effective routing
LatencyLow for large batchesCan increase for small batchesHigher due to pipeline depthCan be low if routing is efficient
ThroughputHigh for large batchesCan be high for large modelsHigh, especially for deep modelsCan be very high for diverse inputs
Typical Use CasesLarge batch inference, embarrassingly parallel tasksVery large models that don’t fit on a single deviceDeep models with sequential dependenciesModels with diverse sub-tasks or specializations
ChallengesLimited by batch size, high memory usageComplex implementation, potential communication bottlenecksPipeline bubble, difficulty in optimal stage partitioningLoad balancing, routing overhead, training instability
Adaptability to Input SizeHighly adaptableLess adaptable, fixed tensor partitioningLess adaptable, fixed pipelineHighly adaptable, different experts for different inputs
Suitable Model TypesMost model typesTransformer-based models, very large neural networksDeep sequential modelsMulti-task models, language models with diverse knowledge
Supported Inference BackendsTensonRT-LLM, vLLM, TGITensonRT-LLM, vLLM, TGITensonRT-LLM, vLLM, TGITensonRT-LLM, vLLM, TGI

Parallelism techniques combined

By now, you might already be thinking that if using the above parallelism methods means we are reducing the overall consumption or utilization of GPU, then can we not combine or replicate these to increase the overall GPU throughput? Combining Inference parallelism methods can lead to more efficient and scalable systems, especially for large and complex models.

In the table below, you can see 4 possible options but in actual scenarios based on the number of GPUs you have, this combination can grow to a very large number.

Data Parallelism (DP)
+
Pipeline Parallelism (PP)
Tensor Parallelism (TP)
+
Pipeline Parallelism (PP)
Expert Parallelism (EP)
+
Data Parallelism (DP)
Tensor Parallelism (TP)
+
Expert Parallelism (EP)
Split the model into stages (pipeline parallelism)Divide the model into stages (pipeline parallelism)Distribute experts across devices (expert parallelism)Split large expert models across devices (tensor parallelism)
Replicate each stage across multiple devices (data parallelism)Split large tensors within each stage across devices (tensor parallelism)Process multiple inputs in parallel (data parallelism)Split large expert models across devices (tensor parallelism)

So for example let’s say you have 64 GPU available and you are planning to deploy llama3-8b or Mistral 8*7B model on them. The following are some of the possible combinations of the parallelism methods. These are just examples to understand the parallelism combination strategies, for the actual use case you need to consider and benchmark other options as well.

LLaMA 3-8B (64 GPUs)

StrategyGPU AllocationProsCons
PP8TP864 = 8 (pipeline) × 8 (tensor)– Balanced distribution
– Reduced Communication
– Pipeline bubbles
– Increased latency
PP4TP4DP464 = 4 (pipeline) × 4 (tensor) × 4 (data)– Higher throughput
– Flexible for batch sizes
– Complex integration
– Requires large batches
DP8TP864 = 8 (data) × 8 (tensor)– Higher throughput
– No pipeline bubbles
– Large batch size needed
– High memory per GPU

Mistral 8*7B (64 GPUs)

StrategyGPU AllocationProsCons
EP8TP864 = 8 (experts) × 8 (tensor per expert)– Balanced memory distribution
– Efficient memory use
– Complex load balancing
– Potential compute underutilization
EP8TP4DP264 = 8 (experts) × 4 (tensor) × 2 (data)– Higher throughput
– Balanced utilization
– Needs careful load balancing
– Large batch sizes required
EP8PP2TP464 = 8 (experts) × 2 (pipeline) × 4 (tensor)– Supports deeper experts
– Flexible scaling
– Increased latency
– Complex synchronization

How to choose one?

Now we have covered 4 methods of Inference parallelism, and then we have multiple combinations of these methods, so a common question you might be having is how do you choose or identify which method to use. So, the choice of inference parallelism methods broadly depends on the following factors:

Model architecture

Model architecture plays a crucial role in determining the most effective inference parallelism strategy. You need to identify your model architecture and then choose the parallelism method or combination of them that fits well. Different model structures fit to different parallelization techniques:

  1. Large, deep models (e.g., GPT-4, PaLM 2):
    • Benefit from pipeline parallelism
    • Can be split into stages across multiple GPUs
    • Good for models with many sequential layers
  2. Wide models with large layer sizes (LLaMA 2, BLOOM):
    • Ideal for tensor parallelism
    • Allow splitting individual layers across GPUs
    • Effective for models with very large matrix operations
  3. Ensemble models or Mixture of Experts (Mixtral 8x7B, Switch Transformers):
    • Well-suited for expert parallelism
    • Different experts can be distributed across GPUs
    • Useful for models that use different sub-networks for various tasks
  4. Models with small computational graphs (GPT-3.5-turbo, Falcon-7B):
    • Often works well with simple data parallelism
    • Can replicate the entire model across GPUs
    • Effective when the model fits entirely in GPU memory
  5. Attention-based models (e.g., Transformers):
    • Can benefit from attention slicing or multi-head parallelism
    • Allow distributing attention computations across GPUs

Use case requirements (Latency vs Throughput tradeoff)

You need to be familiar with your business requirements or use case, i.e., what matters to business more, the latency of the user requests, or utilization of the GPU. Or can you identify the mode of your application i.e., is it a real-time application where response time to the user request is the main driver for the user experience, or is it an offline system where response time to the user request is not the primary concern

The reason we need to be aware of this is the tradeoff between latency and GPU throughput. If you want to reduce the latency to the user request you need to allocate more GPU resources to each request and your choice of parallelism method will rely on that. You can do optimal batching so that requests are not struggling to acquire the GPU resource which increases the overall latency.

However, if latency is not the consideration, then your goal should be to achieve maximum throughput on the GPU by choosing the right parallelism method and appropriate batch sizes that utilize GPU resources to its maximum throughput.

Use CasePreferred ParallelismLatency vs Throughput Consideration
Real-time chatbotsData parallelism or Tensor parallelismLow latency priority; moderate throughput
Batch text processingPipeline parallelismHigh throughput priority; latency less critical
Content recommendationExpert parallelismHigh throughput for diverse inputs; moderate latency
Sentiment analysisData parallelismHigh throughput; latency less critical for bulk processing
Voice assistantsTensor parallelism or Pipeline parallelismVery low latency priority; moderate throughput

Hardware configuration

Hardware configuration often dictates the feasible parallelism strategies, and the choice should be optimized for the specific inference workload and model architecture. Following are some hardware component choices that impact the overall parallelism choices.

Hardware ComponentImpact on Parallelism ChoiceExamples
GPU Memory CapacityDetermines feasibility of data parallelism and influences the degree of model shardingNVIDIA A100 (80GB) allows larger model chunks than A100 (40GB)
Number of GPUsAffects the degree of parallelism possible across all strategies8 GPUs enable more parallelism than 4 GPUs
GPU Interconnect BandwidthInfluences efficiency of tensor and pipeline parallelismNVLink offers higher bandwidth than PCIe, benefiting tensor parallelism
CPU CapabilitiesImpacts data preprocessing and postprocessing in parallelism strategiesHigh-core count CPUs can better handle data parallelism overhead
System MemoryAffects the ability to hold large datasets for data parallelism1TB system RAM allows larger batch sizes than 256GB
Storage SpeedInfluences data loading speeds in data parallelismNVMe SSDs provide faster data loading than SATA SSDs
Network BandwidthCritical for distributed inference across multiple nodesNetworks based on Infiniband, RoCE are faster than conventional networks for GPU fabrics
Specialized HardwareEnables specific optimizationsGoogle TPUs are optimized for tensor operations

Conclusion

AI inference parallelism is a game-changer for running big AI models efficiently. We’ve looked at different ways to split up the work, like data parallelism, tensor parallelism, pipeline parallelism, and expert parallelism. Each method has its own pros and cons, and choosing the right one depends on your specific needs and setup. It’s exciting to see tools like TensorRT-LLMvLLM, and Hugging Face’s Text Generation Inference making these advanced techniques easier for more people to use. As AI models keep getting bigger and more complex, knowing how to use these parallelism techniques will be super important. They’re not just about handling bigger models – they’re about running AI smarter and more efficiently. By using these methods well, we can do amazing things with AI, making it faster, cheaper, and more powerful for all kinds of uses.

The future of AI isn’t just about bigger models; it’s about finding clever ways to use them in the real world. With these parallelism techniques, we’re opening doors to AI applications that were once thought impossible. If you’re looking for experts who can help you scale or build your AI infrastructure, reach out to our AI & GPU Cloud experts.

If you found this post valuable and informative, subscribe to our weekly newsletter for more posts like this. I’d love to hear your thoughts on this post, so do start a conversation on LinkedIn.

References