LLM inference system showing compute memory and network resource sharing across GPU infrastructure
Published On: 9th April 2026|Last Updated: 9th April 2026|By |
Share This Article

Introduction

Large Language Model inference systems operate under strict performance constraints, in which latency, throughput, cost, and energy must be balanced across shared infrastructure. As deployment scales across multi-GPU servers and distributed environments, the problem becomes one of optimising resource sharing in LLM inference systems. This reflects broader AI system architecture and infrastructure scaling challenges observed on modern compute platforms.

The underlying issue is structural. Multiple concurrent requests compete for compute, memory, and network bandwidth. Each request progresses through different execution phases, and its behaviour is often unknown at the time of scheduling. As a result, system performance depends on how effectively resources are allocated, pre-empted, and shared across workloads. These challenges reflect system-level design and verification approaches required for complex, heterogeneous systems.

This article draws on recent research discussions and system-level observations, highlighting that improvements in inference performance are primarily achieved through better coordination across the stack rather than through isolated optimisation.

Key learning points

Key learning pointLink to detailed explanationExternal reference link
LLM inference performance depends on coordinated compute, memory, and network resource managementThe system-level challenge in LLM inference[1]
Prediction-based scheduling improves latency by estimating request characteristics earlyPrediction-based scheduling in LLM systems[1]
KV cache management introduces critical memory trade-offs during execution and tool callsMemory management and KV cache trade-offs[1]
Multi-GPU systems suffer from PCIe contention and unfair bandwidth allocationNetwork and PCIe contention in multi-GPU systems[1]
Future systems require unified intra- and inter-host resource coordinationTowards unified resource management[1]

The system-level challenge in LLM inference

LLM inference systems must simultaneously optimise:

  • Latency, including time to first token and token generation rate
  • Throughput under concurrent request load
  • GPU memory utilisation, particularly KV cache
  • Energy and infrastructure cost

These metrics are interdependent. Improving one often degrades another.

System-level constraints in LLM inference, including latency, throughput, energy, and cost across shared GPU infrastructure
Figure 1: LLM inference system resource interaction [1]

Figure 1 shows that inference requests are not isolated. Each request consumes GPU compute cycles, occupies KV cache memory, and generates communication traffic across interconnects. When multiple requests are active, contention emerges across all three dimensions.

This leads to a central design question: How should resources be shared across requests with unknown and variable behaviour?

Prediction-based scheduling in LLM systems

Traditional scheduling algorithms assume knowledge of job size. In LLM systems, this assumption breaks.

  • Output token length is unknown at request arrival
  • Requests may pause due to tool calls or external dependencies
  • Preemption introduces memory overhead

Conventional approaches include:

  • First-come, first-served when the size is unknown
  • Shortest Job First when size is known
  • Shortest Remaining Processing Time when preemption is allowed

However, these do not map cleanly to LLM workloads.

Prediction-driven approach

Modern systems introduce lightweight predictors that estimate request size using internal model signals. These predictions enable scheduling decisions closer to SRPT behaviour without full knowledge of execution time.

One example is a system that:

  • Predicts output token length early in execution
  • Enables preemption only when memory overhead is acceptable
  • Disables preemption once KV cache grows

This approach improves latency under load while controlling memory growth.

Prediction-based scheduling loop in LLM inference systems using request size estimation and dynamic prioritisation
Figure 2: Prediction-based scheduling loop [1]

Figure 2 shows a feedback-driven scheduler where predictions guide request ordering. The key insight is that even imperfect predictions significantly improve scheduling quality when compared to blind ordering. These decisions reflect a broader risk-based verification strategy and decision confidence in complex systems.

Memory management and KV cache trade-offs

Memory is the dominant constraint in LLM inference systems. The KV cache grows with each generated token and remains allocated for the duration of a request.

This introduces a fundamental trade-off:

  • Preemption reduces latency but increases memory usage
  • Larger KV cache reduces batch size and overall throughput

Tool call complexity

When requests invoke external tools:

  • Execution pauses
  • KV cache remains allocated
  • Memory may be underutilised

At this point, systems must choose between:

  • Preserving KV cache in GPU memory
  • Discarding it and recomputing later
  • Swapping it to CPU or remote memory

Each option has different implications for latency and memory pressure.

Memory-aware scheduling using predicted KV cache usage over time for LLM inference request prioritisation
Figure 3: KV cache memory lifecycle during tool calls [1]

The key observation is that memory cost is not static. It depends on both size and duration. Two requests using the same memory footprint may have very different system impact depending on how long the memory is occupied.

Advanced schedulers, therefore, rank requests based on predicted memory over time, rather than token count alone.

Network and PCIe contention in multi-GPU systems

As systems scale, communication becomes a dominant factor.

LLM inference involves:

  • GPU-to-GPU communication for parallelism
  • KV cache movement between memory tiers
  • CPU-GPU transfers
  • Remote memory access via RDMA

These interactions create contention across heterogeneous interconnects.

Observed issues

From system-level measurements:

  • Multiple GPUs sharing the same CPU PCIe root complex experience degraded latency
  • Bandwidth sharing is not uniform due to protocol-level constraints
  • PCIe arbitration is transaction-based rather than byte-based

For example:

  • RDMA traffic may receive a larger share of bandwidth than expected
  • Smaller transactions may be disadvantaged under contention

These behaviours are not visible at the application level but have direct impact on inference latency.

Structural limitation

PCIe hierarchy imposes constraints:

  • Devices under the same root complex compete for shared links
  • Paths to CPU memory are partitioned
  • Load balancing across links is limited

This leads to non-linear scaling behaviour when adding GPUs.

Towards unified resource management

Current systems treat compute, memory, and network scheduling as separate problems. This separation limits optimisation potential.

A more effective approach requires:

  • Coordinating scheduling decisions across all resource domains
  • Integrating intra-host and inter-host networking awareness
  • Using global addressing and routing mechanisms

Two critical limitations today:

  • Inter-host networks terminate at the NIC and lack visibility into GPU topology
  • Intra-host protocols do not support forwarding across heterogeneous links

As a result, alternative communication paths often remain unused, even when available.

Future systems must address:

  • End-to-end routing across GPU, CPU, and network fabrics
  • Load balancing across PCIe trees and interconnects
  • Unified scheduling policies combining compute, memory, and network awareness

Constraints, trade-offs, and risks

Resource sharing optimisation introduces several constraints:

  • Prediction accuracy versus computational overhead
  • Preemption benefits versus memory growth
  • Throughput optimisation versus latency fairness
  • Hardware topology limitations versus scheduling flexibility

Trade-offs must be explicitly defined:

  • Prioritising short requests improves average latency but may penalise long workloads
  • Aggressive batching improves throughput but increases latency variability
  • Memory offloading reduces GPU pressure but introduces transfer overhead

Risks include:

  • Over-reliance on inaccurate predictions
  • Memory fragmentation under dynamic workloads
  • Unpredictable performance under heterogeneous communication patterns

Effective system design requires quantifying these trade-offs rather than assuming optimal behaviour.

Conclusion

Optimising resource sharing in LLM inference systems is fundamentally a system-level problem. Improvements in latency and throughput are achieved not by isolated optimisation, but by coordinated control of compute, memory, and network resources.

Prediction-based scheduling, adaptive memory management, and awareness of hardware topology all contribute to measurable gains. However, the current generation of systems remains limited by fragmented control across resource domains.

Future architectures will need to unify scheduling and routing decisions across the full stack, enabling consistent performance under scale and variability.

References

[1] Minlan Yu, Optimizing Resource Sharing in LLM Inference Systems, Open Compute Project (OCP) presentation, 2026.

Available at: https://www.dropbox.com/scl/fi/5hdzpy10i4nx9a9dyhhbp/ocp26-public.pptx?rlkey=3j705wiwg59ifwaq2w0p77m6u&e=1&st=tgfyw39t&dl=0

Further technical exploration

For readers interested in deeper engineering approaches:

Share This Article
Persian Pick
Written by : Mike Bartley

Mike started in software testing in 1988 after completing a PhD in Math, moving to semiconductor Design Verification (DV) in 1994, verifying designs (on Silicon and FPGA) going into commercial and safety-related sectors such as mobile phones, automotive, comms, cloud/data servers, and Artificial Intelligence. Mike built and managed state-of-the-art DV teams inside several companies, specialising in CPU verification.

Mike founded and grew a DV services company to 450+ engineers globally, successfully delivering services and solutions to over 50+ clients.

Mike started Alpinum in April 2025 to deliver a range of start-of-the art industry solutions:

Alpinum AI provides tools and automations using Artificial Intelligence to help companies reduce development costs (by up to 90%!) Alpinum Services provides RTL to GDS VLSI services from nearshore and offshore centres in Vietnam, India, Egypt, Eastern Europe, Mexico and Costa Rica. Alpinum Consulting also provides strategic board level consultancy services, helping companies to grow. Alpinum training department provides self-paced, fully online training in System Verilog, UVM Introduction and Advanced, Formal Verification, DV methodologies for SV, UVM, VHDL and OSVVM and CPU/RISC-V. Alpinum Events organises a number of free-to-attend industry events

You can contact Mike (mike@alpinumconsulting.com or +44 7796 307958) or book a meeting with Mike using Calendly (https://calendly.com/mike-alpinumconsulting).

Connect With Us

We understand that you might have a unique situation that you would like to discuss with us, or just be curious to learn more about our service offerings. Regardless, we would like to hear from you – please feel free to contact us.