Accelerating Nemotron Nano 2 9B: From Quantization to KV-Cache

October 28, 2025Rob Greenberg, Eldar Kurtić, Mark Vaykhansky, Shani Alisar, Alexandre Marques, Tyler Michael Smith, Tomer Garber9-minute read

Artificial intelligence

Red Hat is announcing the latest addition to our portfolio of validated and optimized models: a compressed version of NVIDIA Nemotron-Nano-9B-v2. By leveraging the open source LLM Compressor library, we have created a new INT4 (W4A16) variant that unlocks significant performance gains on NVIDIA AI infrastructure with negligible impact on the model's reasoning capabilities.

This release continues our commitment to providing enterprises with open, flexible, and efficient AI solutions that are ready for production deployment across the hybrid cloud. Key benefits include:

Smaller and faster Nemotron: A new, compressed Nemotron-Nano-9B-v2 model in an INT4 (W4A16) format, created using LLM Compressor, offers a smaller version which performs best at latency sensitive use cases.
Optimized for NVIDIA accelerated computing: The INT4 model coupled with our open-source Machete, a mixed-input GEMM kernel optimized for NVIDIA Hopper GPUs, is engineered to deliver substantial improvements in inference speed and efficiency in vLLM.
Validated for the enterprise: This model has been rigorously tested and validated by Red Hat, providing the confidence and predictability needed to deploy AI workloads in enterprise environments on Red Hat AI.
Open and extensible: The model is available in the Red Hat AI repository on Hugging Face, ready for deployment with vLLM and further customization using LLM Compressor.
Hybrid model support in vLLM: With the release of the vLLM V1 engine, hybrid models, like Nemotron-Nano-9B-v2, have become first-class citizens through an exceptionally efficient implementation that delivers blazing performance.

A more efficient Nemotron for all

NVIDIA Nemotron-Nano-9B-v2 stands out for its innovative hybrid Mamba-Transformer architecture, which is specifically designed to accelerate the long reasoning traces common in agentic AI tasks. This 9-billion-parameter model delivers exceptional performance, generating responses up to six times faster than comparable dense models while supporting a 128K token context length and a wide range of languages. This model also has a “thinking budget” feature that limits the model from overthinking, allowing it to optimize the model for accuracy, performance, and inference cost.

To make this powerful model even more accessible and efficient for enterprise use, we have applied state-of-the-art compression techniques using LLM Compressor. Our new variant utilizes the INT4 (W4A16) quantization scheme. This is a mixed-precision format that uses 4-bit integers for the model's weights and 16-bit floating-point numbers for the activations.

The result is a model that is not only faster but also has a significantly smaller memory footprint, enabling more concurrent requests and the ability to handle longer context lengths on the same hardware - gains in both performance and memory efficiency. Extensive evaluations have shown that W4A16 quantization maintains a consistently negligible accuracy loss, making it highly competitive with the unquantized baseline model.

Why INT4 (W4A16)?

For this release, we specifically chose the INT4 (W4A16) quantization scheme to deliver maximum efficiency for latency-sensitive enterprise applications. This mixed-precision format offers several key advantages:

Maximum memory reduction: By compressing model weights to just 4 bits, W4A16 dramatically reduces the model's memory footprint. This is critical for deploying large models on resource-constrained hardware or for maximizing the number of models that can be served on a single GPU.
Optimized for low latency: The W4A16 format offers excellent performance in synchronous, single-stream deployments. This makes it an ideal choice for interactive applications like chatbots, coding assistants, and agentic workflows where fast response times are essential.
High accuracy preservation: Despite the aggressive 4-bit weight compression, the W4A16 scheme maintains a consistently low accuracy loss. By applying state-of-the-art quantization techniques implemented in LLM Compressor, the model preserves the nuance required for complex reasoning tasks, making it competitive with less aggressive 8-bit quantization methods.

While 8-bit formats like INT8 or FP8 can be more effective for high-throughput, asynchronous workloads on high-end GPUs, W4A16 provides a powerful solution for developers who need to prioritize low latency and minimize memory usage without a significant trade-off in accuracy.

Unlocking Long-Context performance: vLLM and the hybrid Mamba-2-Transformer architecture

The innovative hybrid architecture of NVIDIA Nemotron-Nano-9B-v2, which combines traditional attention with Mamba-2 layers, is key to its performance but also presents unique challenges for inference engines. Recent work in the vLLM community has elevated hybrid models from early experimental implementations to fully supported, high-class citizens, unlocking their potential for long-context applications.

Supporting hybrid models in vLLM requires careful treatment of their state. Attention layers rely on a paged KV cache, organized into blocks that are appended as the sequence grows. Mamba-2 layers, by contrast, maintain a large, fixed-size state for each sequence that is updated in place.

[FIGURE 1: KV cache blocks in attention, showing 16-token groups of ~64 KiB each]

[FIGURE 2: Mamba-2 state structure, showing one large per-sequence state updated in place.]

In early vLLM versions (v0), hybrid support was achieved through a fragile hack. KV cache was managed efficiently, but Mamba-2 state was allocated separately based on a user-defined parameter. This forced users to guess the right value to avoid CUDA out-of-memory errors, creating a poor developer experience.

[FIGURE 3: V0 architecture showing paged KV cache for attention and separate per-sequence Mamba-2 state tensors.]

In vLLM V1, support was rebuilt around a unified allocator that manages both KV cache and Mamba-2 state. This design enables advanced features like prefix caching and allows hybrid models to benefit from V1-wide optimizations. However, Mamba-2 complicates this because the size of its state pages is much larger than that of attention blocks. To address this, V1 relaxes the requirement that all layers use the same block size. Instead, attention block sizes are automatically increased until their page size aligns with Mamba-2’s. While using unusual block sizes might seem inefficient, empirical testing shows little impact on performance.

[FIGURE 4: Hybrid page alignment showing how attention block sizes are increased and Mamba-2 pages padded to achieve alignment.]

This first-class Hybrid Mamba-2-Transformer support in vLLM delivers several key benefits:

Efficient long-context processing: Mamba-2's primary advantage is its linear-time processing and constant-size state, which avoids the memory and compute costs that cause traditional attention to slow down on long sequences. vLLM's unified memory management fully enables this capability, making Nemotron-Nano-9B-v2 highly effective for use cases like RAG and complex agentic workflows that rely on its 128K context window.
Optimized performance with CUDA graphs: Mamba-2 architectures can introduce significant CPU overhead due to multiple kernel calls. vLLM's full support for CUDA graphs is crucial for these models, as it captures the entire model execution graph and dramatically reduces CPU bottlenecks. This leads to substantial improvements in low-concurrency scenarios.
Foundation for future features: By treating hybrid models as a core part of the architecture, this unified approach ensures that future vLLM features like prefix caching and prefill-decode disaggregation will work seamlessly, further enhancing performance and efficiency.

Performance

The primary benefit of quantization is the critical trade-off between inference performance and model accuracy. The INT4 (W4A16) format is specifically engineered to maximize this trade-off for certain enterprise workloads. In single-stream, low-latency, synchronous deployments, reducing model weights to 4-bit integers directly translates to lower latency because it effectively reduces the amount of bits to transfer by 75%. This makes it an excellent choice for interactive, real-time applications where a user is actively waiting for a response.

For use cases that demand high throughput with many concurrent users (asynchronous deployment), 8-bit formats like INT8 or FP8 often provide better overall performance on high-end GPUs. However, for developers prioritizing the fastest possible response time for individual queries or deploying on memory-constrained edge devices, the INT4 model delivers compelling performance gains.

To ensure a consistent and transparent evaluation, all benchmarks were conducted on a single NVIDIA H100 GPU (H100x1) using vLLM (v0.11.0) as the underlying inference backend.

Each model was launched with the following command:

bash
python3 -m vllm.entrypoints.openai.api_server \
    --model <model> \
    --max-model-len 16384 \
    --tensor-parallel-size 1 \
    --trust-remote-code

We used GuideLLM to drive benchmark traffic and measure performance under different constant RPS (Request Per Second) values ranging from 1 to 9. The benchmark configuration for a given RPS was:

bash
guidellm benchmark \
    --target http://localhost:10000 \
    --output-path /tmp/outputs/guidellm_results/report.json \
    --rate-type constant \
    --rate <rate 1-9> \
    --data prompt_tokens=512,prompt_tokens_stdev=51,prompt_tokens_min=410,prompt_tokens_max=614,output_tokens=256,output_tokens_stdev=26,output_tokens_min=205,output_tokens_max=307,samples=1000 \
    --max-seconds 900 \
    --processor <model> \
    --model <model> \
    --request-samples 2147483647 \
    --stop-over-saturated \
    --max-error-rate 0.5

This configuration simulates realistic workloads with 512 input/256 output token lengths, with some standard deviation, and controlled RPS rates (1–9) to capture performance across different low-latency relevant workloads. As you can see in the graph below the higher the RPS the more the INT4 quantized model shines and delivers superior latency performance.

[FIGURE 5: Latency performance vs. requests per second (RPS) for Nemotron-Nano-9B-v2 models (unquantized (BF16), FP8, and INT4) on an Nvidia H100, with 512 input and 256 output tokens. Lower latency is faster.]

Accuracy

While aggressive quantization can sometimes result in a noticeable degradation in model quality, our team has developed and fully open-sourced advanced methods in LLM Compressor that enhance state-of-the-art quantization algorithms such as GPTQ, pushing the boundaries of accuracy recovery in quantized models. For this Nemotron model in particular, we augment the standard GPTQ approach with mean-squared-error–optimal quantization scales and an importance-based ordering of weight quantization. This enables the algorithm to compensate for the quantization errors of more challenging weights by leveraging the redundancy of weights that are easier to quantize.

The Nemotron-Nano-9B-v2 model itself achieves state-of-the-art accuracy on a wide range of reasoning benchmarks, outperforming other models in its class on tasks like math, coding, and general knowledge. Our INT4 variant retains the vast majority of this capability, ensuring that the benefits of quantization come with minimal trade-off in reasoning performance.

In the figure below, we present reasoning results for three configurations: the baseline (unquantized) Nemotron-Nano-9B-v2 model, its FP8 weight-and-activation quantized counterpart, and our INT4 weight-only quantized variant. As shown, the well-optimized INT4 model performs on par with both the FP8 and unquantized baselines on popular reasoning benchmarks including AIME25, GPQA-Diamond, and MATH-500.

[FIGURE 6: Reasoning benchmark results (AIME25, GPQA-Diamond, MATH-500) comparing the baseline Nvidia Nemotron-Nano-9B-v2 against the Nvidia FP8 and Red Hat AI INT4 quantized variants.]

Compressed Nemotron in action

The INT4 compressed Nemotron-Nano-9B-v2 model is ready for immediate deployment on Red Hat AI, integrating seamlessly with the vLLM serving engine in Red Hat AI Inference Server and Red Hat OpenShift AI. Developers can get started quickly with just a few lines of code.

Deploy on vLLM

bash
VLLM_USE_PRECOMPILED=1 uv pip install --no-cache git+https://github.com/vllm-project/vllm.git


vllm serve RedHatAI/NVIDIA-Nemotron-Nano-9B-v2-quantized.w4a16 \
    --trust-remote-code \
    --max-num-seqs 64 \
    --mamba_ssm_cache_dtype float32

Note:

Remember to add `--mamba_ssm_cache_dtype float32` for accurate quality. Without this option, the model’s accuracy may degrade.
If you encounter a CUDA OOM issue, try `--max-num-seqs 64` and consider lowering the value further if the error persists.
See more ways to deploy with vLLM.

Deploy on Red Hat AI

Deploy on Red Hat AI Inference Server

bash
podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \
 --ipc=host \
--env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \
--env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \
--name=vllm \
registry.access.redhat.com/rhaiis/rh-vllm-cuda \
vllm serve RedHatAI/NVIDIA-Nemotron-Nano-9B-v2-quantized.w4a16 \
--trust-remote-code \
--max-num-seqs 64 \
--mamba_ssm_cache_dtype float32

See Red Hat AI Inference Server documentation for more details.

Deploy on Red Hat OpenShift AI

1. Set up the vLLM server with ServingRuntime

Save the following as vllm-servingruntime.yaml:

YAML
apiVersion: serving.kserve.io/v1alpha1
kind: ServingRuntime
metadata:
 name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name
 annotations:
   openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe
   opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]'
 labels:
   opendatahub.io/dashboard: 'true'
spec:
 annotations:
   prometheus.io/port: '8080'
   prometheus.io/path: '/metrics'
 multiModel: false
 supportedModelFormats:
   - autoSelect: true
     name: vLLM
 containers:
   - name: kserve-container
     image: quay.io/modh/vllm:rhoai-2.25-cuda
     command:
       - python
       - -m
       - vllm.entrypoints.openai.api_server
     args:
       - "--port=8080"
       - "--model=/mnt/models"
       - "--served-model-name={{.Name}}"
     env:
       - name: HF_HOME
         value: /tmp/hf_home
     ports:
       - containerPort: 8080
         protocol: TCP

2. Attach the model to the vLLM server with InferenceService

Save the following as inferenceservice.yaml:

YAML
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  annotations:
    openshift.io/display-name: NVIDIA-Nemotron-Nano-9B-v2-quantized.w4a16 # OPTIONAL CHANGE
    serving.kserve.io/deploymentMode: RawDeployment
  name: nvidia-Nemotron-Nano-9B-v2-quantized.w4a16          # specify model name. This value will be used to invoke the model in the payload
  labels:
    opendatahub.io/dashboard: 'true'
spec:
  predictor:
    maxReplicas: 1
    minReplicas: 1
    model:
      modelFormat:
        name: vLLM
      name: ''
      resources:
        limits:
          cpu: '2' # this is model specific
          memory: 8Gi # this is model specific
          nvidia.com/gpu: '1' # this is accelerator specific
        requests: # same comment for this block
          cpu: '1'
          memory: 4Gi
          nvidia.com/gpu: '1'
      runtime: vllm-cuda-runtime # must match the ServingRuntime name above
      storageUri: oci://registry.redhat.io/rhelai1/modelcar-nvidia-Nemotron-Nano-9B-v2-quantized.w4a16:1.5
    tolerations:
    - effect: NoSchedule
      key: nvidia.com/gpu
     operator: Exists

3. Apply the resources to your cluster

bash
# Make sure first to be in the project where you want to deploy the model
# oc project <project-name>
# Apply the ServingRuntime
oc apply -f vllm-servingruntime.yaml
# Apply the InferenceService
oc apply -f inferenceservice.yaml

4. Call the server using curl

bash
# Replace <inference-service-name> and <cluster-ingress-domain> below:
# - Run `oc get inferenceservice` to find your URL if unsure.
curl https://<inference-service-name>-predictor-default.<domain>/v1/chat/completions \
        -H "Content-Type: application/json" \
        -d '{
    "model": "nvidia-Nemotron-Nano-9B-v2-quantized.w4a16",
    "stream": true,
    "stream_options": {
        "include_usage": true
    },
    "max_tokens": 1,
    "messages": [
        {
            "role": "user",
            "content": "How can a bee fly when its wings are so small?"
        }
    ]
}'

See Red Hat OpenShift AI documentation for more details.

Validated for the enterprise: Deploying with confidence

True enterprise AI requires more than just a high score on a public benchmark. It demands confidence, predictability, and flexibility. Deploying a model into production without rigorous testing can lead to spiraling inference costs, rising latency under load, and unexpected alignment issues—all of which pose a risk to business operations.

This is why Red Hat’s model validation process is so critical. We move beyond leaderboard hype to assess AI models with real-world data and tasks, across diverse hardware, targeting key enterprise use cases. Our validation is built on four pillars designed to accelerate your time to value:

Clear deployment guidance: We provide the confidence and predictability you need to select and deploy the right third-party models for your specific needs on Red Hat AI. Our process offers guidance to right-size your deployments, helping you select the optimal combination of models and hardware for your use cases and accelerate your time to production.
Concrete data: We offer a clear understanding of each model's scalable performance and accuracy, all tested within the context of specific hardware and real-world enterprise scenarios. By running workload-specific benchmarks, we provide transparent results that clarify the complex trade-offs between performance, accuracy, and cost.
Reproducible results: We ensure that the performance you see in our benchmarks can be reliably reproduced in your own production environments. Our entire validation process uses open source tools like the Language Model Evaluation Harness and vLLM, ensuring our accuracy and performance results are transparent and repeatable.
Enterprise-ready packaging: We provide standardized container formats in our production registry to create a version-controlled asset with an enhanced security footprint for your AI workloads. These assets are scanned for vulnerabilities and simplify deployment and lifecycle management, integrating into a trusted AI software supply chain from creation to deployment.

By running the INT4 Nemotron-Nano-9B-v2 variant through this process, we provide organizations with the empirical data needed to confidently deploy the best-fit models on their infrastructure of choice—including Red Hat OpenShift AI, Red Hat Enterprise Linux AI, and Red Hat AI Inference Server—with full visibility into expected performance and cost.

Open and scalable AI

The release of the compressed INT4 Nemotron-Nano-9B-v2 model exemplifies the power of a collaborative, open source ecosystem. By combining a state-of-the-art model from NVIDIA with open, extensible tools like LLM Compressor and vLLM, and backing it with Red Hat’s enterprise-grade validation and support, we are making powerful AI more accessible, efficient, and reliable for everyone.

Explore the new model and our compression recipes in the Red Hat AI Repository on Hugging Face or on the Red Hat Container Registry, and soon the Red Hat OpenShift AI 3.0 catalog.

About the authors

Rob Greenberg

Principal Product Manager

My name is Rob Greenberg, Principal Product Manager for Red Hat AI, and I came over to Red Hat with the Neural Magic acquisition in January 2025. Prior to joining Red Hat, I spent 3 years at Neural Magic building and delivering tools that accelerate AI inference with optimized, open-source models. I've also had stints as a Digital Product Manager at Rocketbook and as a Technology Consultant at Accenture.

Read full bio

Eldar Kurtić

Principal Research Scientist

Eldar is a research scientist specializing in efficient inference techniques for large machine learning models, with a particular focus on sparsity, quantization, and speculative decoding. His work focuses on developing methods to accelerate inference within the vLLM engine, bridging cutting-edge research with practical deployment solutions. In his spare time, Eldar enjoys making GPUs go brrr.

Read full bio

Mark Vaykhansky

Senior Machine Learning Engineer

Mark is a Senior Machine Learning Engineer at Red Hat. Driven by a deep passion for technology, he focuses on developing cutting-edge ML solutions to solve complex challenges and drive innovation.

Read full bio

Shani Alisar

Senior software engineer

I’m a Senior Software Engineer at Red Hat, where I focus on building scalable and reliable system for running LLM benchmark pipelines in scale. Before joining Red Hat, I worked for four years as a Senior Data Engineer, specializing in data pipelines, infrastructure, and performance optimization.

Read full bio

Alexandre Marques

Manager of Machine Learning Research

I am a scientist specialized in the development of numerical models and computational algorithms. My expertise is bridging the gap between academic research and industry innovation.

I am currently the Manager of Machine Learning Research at Red Hat, joining Red Hat as part of the acquisition of Neural Magic. I am proud to lead a world-class team of researchers and engineers that focuses on algorithms for AI inference optimization. We specialize in sparsity, quantization, knowledge distillation and speculative decoding. We work hand-in-hand with the vLLM developers to deliver optimizations that lead to real-world performance gains.

Read full bio

Tyler Michael Smith

Chief Architect - Inference Engineering and Member of Technical Staff

Tyler received a PhD in Computer Science from The University of Texas at Austin, studying high performance dense linear algebra - microkernels, parallelism, and theoretical lower bounds on data movement. Later joined Neural Magic, first working on a graph compiler for sparse neural network inference on CPUs. Now he works on large language model inference in vLLM and llm-d, especially in large scale serving.

Read full bio

Tomer Garber

Senior Machine Learning Engineer

Tomer is a Senior Machine Learning Engineer at Red Hat, where he is fascinated by the challenge of LLM benchmarking. He joined Red Hat in 2025 after the acquisition of the startup Jounce. Tomer's journey into AI began with an MSc in Electrical Engineering, specializing in generative AI for image restoration. Today, he is dedicated to finding out what LLMs can really do, ensuring they are reliable, and pushing their capabilities forward.

Read full bio

Keep exploring

Browse by channel

Explore all channels

Engage & learn

Services & support

Services

Accelerating Nemotron Nano 2 9B: From Quantization to KV-Cache

A more efficient Nemotron for all

Why INT4 (W4A16)?

Unlocking Long-Context performance: vLLM and the hybrid Mamba-2-Transformer architecture

Performance

Accuracy

Compressed Nemotron in action

Deploy on vLLM

Deploy on Red Hat AI

Deploy on Red Hat AI Inference Server

Deploy on Red Hat OpenShift AI

Validated for the enterprise: Deploying with confidence

Open and scalable AI

The adaptable enterprise: Why AI readiness is disruption readiness

About the authors

Rob Greenberg

Eldar Kurtić

Mark Vaykhansky

Shani Alisar

Alexandre Marques

Tyler Michael Smith

Tomer Garber

More like this

Keep exploring

Browse by channel

Platforms

Tools

Try, buy, & sell

Communicate

About Red Hat

Change page language

Red Hat legal and privacy links

Red Hat legal and privacy links