Executive summary
As generative AI (gen AI) workloads become central to enterprise applications, benchmarking their inference performance has never been more critical for understanding the limits of their capabilities. In MLPerf Inference v5.1, Meta’s Llama 3.1-8B was featured for the first time. This post presents Red Hat’s submissions using the FP8 quantized llama 3.1-8b model with vLLM 0.10.0 on Red Hat Enterprise Linux (RHEL) 9.6, outlining our approach to achieving reproducible, high-throughput LLM inference. Our configuration delivered competitive single-GPU performance, achieving 5777 tokens/sec (Offline) and 5103 tokens/sec (Server) on a single H100 GPU, and 1642 tokens/sec (Offline) and 1207 tokens/sec (Server) on a single L40S GPU.
In this article, you’ll hear about MLPerf test scenarios, benchmark setup, tuning methodology, and results, and learn how these efforts align with Red Hat’s broader strategy to deliver open, scalable, and production-grade AI inference platforms.
Introduction
The accelerated pace of innovation in large language models (LLMs) and their associated ecosystems has increased widespread adoption across virtually every industry. This momentum is fueled by continual advancements in hardware accelerators, including modern GPUs and custom AI Application-Specific Integrated Circuits (ASICs) that unlock an unprecedented level of parallelism and computational throughput. Complementing these advancements, software-layer innovations are transforming how efficiently these hardware capabilities are harnessed. At the heart of this software evolution are inference engines, which serve as the execution backbone for deploying foundation models. These engines optimize how model weights, activation patterns, and memory flows are mapped to underlying hardware, achieving substantial gains in latency, throughput, and energy efficiency. Modern inference engines dynamically adapt to heterogeneous environments across cloud, edge, and on-premise deployments. They also integrate advanced scheduling, quantization, caching, and compilation techniques to accelerate end-to-end inference pipelines.
As the gen AI ecosystem matures, businesses are starting to use enterprise-grade systems that bring together these inference engines with orchestration, observability, and lifecycle management capabilities. This combination allows for scalable, security-focused, and reliable deployment of LLM applications for use cases like conversational AI, code generation, document summarization, and more. Red Hat plays a pivotal role in this ecosystem, actively contributing to open source initiatives such as vLLM and LLM-D, while simultaneously delivering enterprise-grade AI platforms such as Red Hat AI Inference Server, Red Hat Enterprise Linux AI, and Red Hat OpenShift AI that integrate these inference technologies. Through this approach, Red Hat bridges the gap between community-based innovation and production-ready AI infrastructure, helping organizations build, optimize, and operationalize
gen AI workloads at scale.
As these diverse inference systems evolve, each tuned for different models, hardware, and deployment goals, understanding how they perform in practice becomes more and more important. Measuring performance consistently across such variation requires a common, reproducible framework that goes beyond raw speed to capture accuracy, efficiency, and scalability under realistic conditions. To address these challenges, the MLPerf Inference benchmark suite, developed under the MLCommons® consortium, has become the de facto standard for transparent and reproducible performance evaluation. MLPerf defines clear rules for datasets, accuracy targets, quality thresholds, and test scenarios. This helps ensure that results can be compared and repeated across various hardware and software configurations, helping vendors and open source communities alike demonstrate end-to-end system efficiency.
MLPerf test scenarios and vLLM harness
MLPerf test scenarios
The MLPerf benchmarking framework provides the tools to benchmark models; specifically, a dedicated load generator that generates traffic patterns mimicking real-world workloads. The original MLPerf inference paper introduced 4 key scenarios: Single-Stream, Multi-Stream, Server, and offline. The datacenter category primarily focuses on the Server and Offline scenarios.
In offline mode, we assess the raw throughput of the system-under-test by batch processing the entire dataset of queries. In server mode, the benchmark simulates an interactive environment, where the load generator sends requests to the system one at a time, following a Poisson distribution, achieving a specified average requests-per-second rate. This mode also requires adherence to specific Time To First Token (TTFT) and Time To Production Of Output Tokens (TPOT) bounds. Additionally, there are optional interactive scenarios that resemble the server scenario but impose stricter latency bounds, designed to mimic applications like chat assistants. In addition to measuring performance, each submission must also meet the accuracy criteria defined by the MLPerf community for the specific model and dataset.
vLLM harness
The reference implementations provided by MLPerf are made to be generic and consistent but are often limited in speed, flexibility, and ease of integration. To better align with our submission goals and improve usability, Red Hat developed a custom vLLM-based harness to evaluate performance across both Offline and Server scenarios. The harness streamlines benchmarking and supports both direct LLM API calls [generate(prompt="Summarize this text")] and OpenAI-compatible server endpoints [/v1/completions]
The harness is organized into 2 components (SUT_Offline.py and SUT_Server.py) that correspond to the MLPerf Offline and Server scenarios. The Offline harness supports both direct LLM API calls (passing prompt token IDs) and OpenAI-compatible endpoints (using raw prompt text). It includes flexible batching and sorting logic based on token length or similarity, allows setting custom engine parameters, and can export endpoint metrics to CSV for analysis. The Server harness extends this functionality by launching a dedicated thread for each incoming query, streaming responses back in real time while recording the first-token latency required by the MLPerf Server scenario. It also provides configurable controls for LoadGen tuning, including the ability to set target QPS and enable query coalescing, allowing finer control on loadgen behavior. The harness remains under active development and will be extended to support additional models in the upcoming MLPerf Inference v6.0 round.
Benchmarking setup and results
Model and dataset
Under consideration is the Llama-3.1-8B model, with 8 billion parameters and falling under the small LLM category. Llama-3.1-8B supports a large context of approximately 131 thousand tokens and hence is adept at handling tasks such as summarization. The benchmark dataset is a subset of the CNN-DailyMail validation dataset and is popular among the publicly available text-summarization datasets. The evaluation dataset has an average input length of 778 tokens and output length of 73 tokens and has a total of 13,368 samples.
We used Red Hat’s FP8 quantized version [RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8] of the original model. This checkpoint has FP8 quantization applied to all weights and activations of linear layers, utilizing static per-tensor scales. FP8 quantization helps reduce model footprint and improve throughput and energy efficiency. The model was quantized using LLM Compressor, a library that provides a comprehensive set of quantization algorithms for weights and activations, streamlined Hugging Face integration, safetensors compatibility with vLLM, and CPU offloading for large model support.
Performance tuning: Autotune
In the context of AI infrastructure, where performance efficiency has a direct impact on power and operational costs, even small performance gains can translate into significant savings at scale. Given the rapid evolution and diversity of AI models and datasets, a “one-size-fits-all” approach is impractical.
The vLLM community continues to incorporate industry best practices into its architecture while maintaining the flexibility for users to fine-tune configurations for their unique workloads. To achieve optimal performance, vLLM exposes a wide range of tunable parameters across its execution engine. However, identifying the best configuration often demands deep engineering insight and extensive experimentation.
To streamline this process, we developed an autotuning system that integrates with both the test harness and inference endpoint. This system automatically explores the parameter space, using performance metrics to suggest optimized settings. Users can specify vLLM tunables, environment variables, system-level configurations, harness parameters, as well as the number of trials and GPUs to be used. The tuner systematically records each trial and observation in a structured database, enabling repeatable outcomes and analysis. Our autotuning experiments have shown 5–6% performance improvements compared to the human engineered performance parameters for the Server scenario on NVIDIA H100 GPUs, which demonstrate the tangible benefits of automated, data-driven optimization in high-performance inference benchmarking. We welcome community collaboration on the autotuning repository.
Results
Llama-3.1-8B | Offline (Tokens/s) | Server (Tokens/s) |
1 x L40s | 1642.22 | 1207.14 |
1 x H100 | 5777.08 | 5103.99 |
MLPerf v5.1 results were obtained from 2 distinct hardware testbeds: an IBM Cloud gx3 instance (1x L40S-PCIE-48GB) and a Dell PowerEdge XE8640 (1x H100-SXM-80GB). All evaluations were conducted using upstream vLLM 0.10.0 on RHEL 9.6. These results reflect single-GPU configurations. For fair comparison with other submissions, results should be normalized by the number of accelerators used.
Hardware specifications
- Dell PowerEdge XE8640:
- CPU: Intel Xeon 6448Y 2.1GHz (2x 32 cores each)
- Virtualization: Single Node OpenShift Virtualization VM (OCP version: 4.18.16, OCP virt version: 4.18.11, VM OS: RHEL 9.6)
- RAM: 128GB provisioned to the VM
- GPU: H100-SXM-80GB provisioned to the VM via PCI Passthrough
- GPU Software: NVIDIA driver 570.148.08, CUDA 12.8
- IBM Cloud gx3 instance:
- CPU: Intel Xeon Processor (SapphireRapids) with 48 cores
- RAM: 240GB
- GPU: 2x L40S with 48GB vRAM each (single GPU used for this submission)
- GPU Software: IBM CUDA driver 575.57.08, CUDA 12.9
Software and vLLM configuration:
For both platforms, upstream vLLM 0.10.0 was used on RHEL 9.6.
- vLLM configuration: Refer to the README for detailed parameters.
- Additional hardware specifications: See hardware specs.
- Overall results: Access the full results here.
Performance achieved:
On a single H100 GPU, the system achieved 5777 tokens/sec in the Offline category and 5103 tokens/sec in the Server category. For the more vRAM-constrained L40S, the performance was 1642 tokens/sec (Offline) and 1207 tokens/sec (Server). Across comparable submissions, our system demonstrated leading performance within similar hardware configurations and remained competitive with other comparable GPU configurations. In the Server scenario, results with the L40s indicate room for further fine-tuning to close the gap with our Offline performance.
Future scaling considerations:
The initial objective was to maximize single-GPU performance before scaling to larger GPU configurations. Due to the limited timeframe for Red Hat's studies, larger replicas were not run. However, with vLLM's data parallel capabilities and load balancers, scaling is not anticipated to be a significant challenge, especially given that the model is small enough to fit within a single GPU.
Future outlook and concluding remarks
In our MLPerf Inference v5.1 submissions, Red Hat demonstrated the performance and scalability of open, enterprise-grade inference using vLLM 0.10.0 on RHEL 9.6. Running RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8, the system achieved 5777 tokens/sec (Offline) and 5103 tokens/sec (Server) on a single H100 SXM 80 GB GPU, with competitive efficiency also observed on the L40S PCIe 48 GB platform. These results highlight the benefits of FP8 quantization to improve throughput and energy efficiency.
Built on vLLM’s foundation, Red Hat AI Inference Server delivers a hardware-agnostic, production-ready inference platform on Red Hat Enterprise Linux. Together with ongoing contributions to vLLM and llm-d, Red Hat continues to advance open, reproducible, and high-performance inference for gen AI. Future MLPerf submissions will expand across more model categories (e.g., MoEs, speech2text, VLMs) and hardware (e.g., Nvidia, AMD, Custom ASICs), further showcasing Red Hat’s commitment to open innovation and enterprise-grade AI performance.
To learn more about Red Hat’s AI inference capabilities and its collaboration within the open source community, visit the Red Hat AI Inference Server page.
关于作者
Performance engineer working on MLPerf, LLM Inference performance and profiling . Previous experience in hardware performance modelling
Diane Feddema is a Principal Software Engineer at Red Hat Inc in the Performance and Scale Team with a focus on AI/ML applications. She has submitted official results in multiple rounds of MLCommons MLPerf Inference and Training, dating back to the initial MLPerf rounds. Diane Leads performance analysis and visualization for MLPerf benchmark submissions and collaborates with Red Hat Hardware Partners in creating joint MLPerf benchmark submissions.
Diane has a BS and MS in Computer Science and is presently co-chair of the Best Practices group of the MLPerf consortium.
Michey is a member of the Red Hat Performance Engineering team, and works on bare metal/virtualization performance and machine learning performance.. His areas of expertise include storage performance, Linux kernel performance, and performance tooling.
Keith is a member of the Performance and Scale Engineering team, and works on various aspects of LLM performance on Red Hat platforms. His expertise is in system configuration for both RHEL and OpenShift, and developing tooling to aid in performance analysis efforts.
Ashish Kamra is an accomplished engineering leader with over 15 years of experience managing high-performing teams in AI, machine learning, and cloud computing. He joined Red Hat in March 2017, where he currently serves as the Senior Manager of AI Performance at Red Hat. In this role, Ashish heads up initiatives to optimize performance and scale of Red Hat OpenShift AI - an end to end platform for MLOps, specifically focusing on large language model inference and training performance.
Prior to Red Hat, Ashish held leadership positions at Dell EMC, where he drove the development and integration of enterprise and cloud storage solutions and containerized data services. He also has a strong academic background, having earned a Ph.D. in Computer Engineering from Purdue University in 2010. His research focused on database intrusion detection and response, and he has published several papers in renowned journals and conferences.
Passionate about leveraging technology to drive business impact, Ashish is pursuing a Part-time Global Online MBA at Warwick Business School to complement his technical expertise. In his free time, he enjoys playing table tennis, exploring global cuisines, and traveling the world.