Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking

ArXiv 2026

Abstract

Efficiently processing long sequences with Transformer models usually requires splitting the computations across accelerators via context parallelism. The dominant approaches in this family of methods, such as Ring Attention or DeepSpeed Ulysses, enable scaling over the context dimension but do not focus on memory efficiency, which limits the sequence lengths they can support. More advanced techniques, such as Fully Pipelined Distributed Transformer or activation offloading, can further extend the possible context length at the cost of training throughput. In this paper, we present UPipe, a simple yet effective context parallelism technique that performs fine-grained chunking at the attention head level. This technique significantly reduces the activation memory usage of self-attention, breaking the activation memory barrier and unlocking much longer context lengths. Our approach lowers the peak activation memory usage in the attention layer by as much as 87.5% for 32B Transformers, while matching previous context parallelism techniques in terms of training speed. UPipe can support maximum context lengths of up to 5M tokens for training 8B models on an 8×H100 node, improving upon prior methods by 25%.

The Memory Wall

The demand for training on longer sequences of data is growing rapidly, fueled by modern applications such as code-assistance and video-generation. However, training on very long sequences has a prohibitive memory cost, motivating the need for memory-efficient context parallelism. The figure below shows how the memory usage changes with different optimization techniques applied to a Llama3-8B model.


While techniques like FSDP helps lower the memory demand due to model parameters and the associated gradients, it does not address the activation memory footprint. DeepSpeed Ulysses is a context parallelism scheme that distributes the input sequence across multiple devices, helping lower the activation memory footprint. ALST addresses the activation memory for non-attention stages via tiling, but overlooks the attention stage. Thus, in our work, we present UPipe, which addresses the attention stage activation memory by performing fine-grained chunking at the attention head level.


Activation memory breakdown showing that attention dominates memory usage in long-context Transformer training.

Llama3-8B: Activation memory breakdown for context length of 3M training on 8×H100. Attention activations dominate the overall memory footprint, motivating the need for memory-efficient context parallelism. ALST handles activation memory for non-attention stages via tiling, but overlooks the attention stage. UPipe lowers the peak activation memory usage in the attention stage by headwise chunking.

Untying the Attention Heads

DeepSpeed Ulysses, a widely used context parallelism technique, distributes the sequence across devices and uses All-to-All communication to exchange key-value pairs before computing full multi-head attention. While effective for scaling, this method materializes all attention head activations simultaneously, leading to high peak memory that limits the maximum context length.

Untied Ulysses (UPipe) addresses this by performing fine-grained chunking at the attention head level. Instead of computing all heads at once, we process a subset of attention heads at a time, reusing QKV projection and All-to-All communication buffers across chunks. This simple modification dramatically reduces the peak activation memory in the attention layer without affecting the mathematical correctness of the computation.

In the example depicted below for H = 4 heads, DeepSpeed Ulysses processes all 4 heads at once, thus requiring memory buffers for all 4 heads of QKV as well as the corresponding all-to-all buffers. UPipe, on the other hand, processes the heads in chunks of U = 2. Thus, it requires H/U = 2 stages to perform attention, where it reuses the memory buffers from the previous stage to store intermediate tensors in the later stages.

Untied Ulysses method overview: headwise chunking with buffer reuse.

Figure: (a) DeepSpeed Ulysses performs QKV projection and All-to-All communication followed by full multi-head attention. (b) UPipe processes attention heads in fine-grained chunks, reusing QKV and communication buffers to reduce peak activation memory usage, thereby allowing for longer context lengths.

GQA Scheduling

To maintain compatibility with Grouped Query Attention (GQA), we implement a GQA-aware scheduling strategy that allows for headwise chunking. As shown in the figure, we schedule attention for heads out of order, to avoid redundant communication of KV in subsequent stages.

For example, in stage-0, we perform attention for KV heads 0,1,2,3 and correspondingly, pick the first query heads from the respective groups Q0, Q4, Q8, Q12. In stage-1, we no longer need to communicate KV heads 0,1,2,3 since they were already communicated in stage-0, and are already present on the correct devices. So we now pick the next query heads Q1, Q5, Q9, Q13. Thus we saved the communication for the KV in stages 1-3, reducing runtime and memory overhead.

GQA Scheduling

GQA Scheduling: UPipe schedules attention for different QKV heads out of order, to avoid redundant communication of KV in subsequent stages. Finally, we rearrange the output heads to maintain correctness.

Results

We evaluate Untied Ulysses against several context parallelism baselines on two model families across a wide range of context lengths (128K–5M tokens) on 8×H100 nodes. Training throughput is reported in tokens per second per GPU. Bold values indicate the best result for each context length.

Llama3-8B (8×H100)

Method 128K 256K 512K 1M 2M 3M 4M 5M
Native PyTorch 1373.87 845.99 474.30 249.85 OOM
Ring 2064.90 1387.67 841.05 458.51 237.99 159.96 OOM
Ulysses 2320.47 1503.80 878.63 475.33 246.05 162.41 OOM
FPDT 1171.68 884.75 621.20 382.42 219.53 153.48 119.76
UPipe 2281.05 1487.29 867.17 472.53 246.07 166.32 125.56 98.25

Qwen3-32B (16×H100)

Method 128K 256K 512K 1M 2M 3M 4M 5M
Native PyTorch 127.03 112.20 91.39 OOM
Ring 418.39 308.88 194.44 110.27 58.45 OOM
Ulysses 545.29 370.70 217.04 117.02 59.98 OOM
FPDT 286.40 217.85 151.91 95.88 55.41 38.86 27.66
UPipe 483.29 339.56 204.46 113.26 59.56 40.42 29.97 OOM

On Llama3-8B, Untied Ulysses matches the throughput of DeepSpeed Ulysses at shorter context lengths while being the only method that can scale to 5M tokens—a 25% improvement over the next best method (FPDT at 4M). At 2M+ tokens, Untied Ulysses consistently achieves the highest throughput among all methods.

On the larger Qwen3-32B model, UPipe achieves on par performance with DeepSpeed Ulysses, while consistently outperforms FPDT at all context lengths. UPipe also achieves a 2× longer context length over DeepSpeed Ulysses.

Llama3-8B Multi-node (16×H100)

We evaluate the performance of UPipe using Llama3-8B on 16×H100 GPUs. We use a hybrid context parallelism setup similar to USP-Hybrid, with 8-ulysses-2-ring, i.e., Ulysses degree of 8 within the node, and ring degree of 2 across nodes. The figure on the right compares the training throughput and memory usage across methods as context length scales. UPipe consistently achieves throughput on par with DeepSpeed Ulysses, while using less memory. Further, UPipe is able to support a maximum context length of 8M tokens, improving upon USP-Hybrid's 6M tokens by 33%.

Multi-node memory comparison across methods

BibTeX

@misc{ghadia2026untiedulyssesmemoryefficientcontext,
        title={Untied Ulysses: Memory-Efficient Context Parallelism via Headwise Chunking}, 
        author={Ravi Ghadia and Maksim Abraham and Sergei Vorobyov and Max Ryabinin},
        year={2026},
        eprint={2602.21196},
        archivePrefix={arXiv},
        primaryClass={cs.LG},
        url={https://arxiv.org/abs/2602.21196},
}