vLLM is a popular open-source framework for large language model (LLM) inference that is widely leveraged across the industry. Its popularity comes from its ability to deliver high-throughput, low-latency serving for LLMs while keeping GPU memory usage efficient via techniques like PagedAttention. This makes it well-suited for real-time applications (e.g., chatbots, copilots) as well as large-scale batch inference (e.g., embedding generation, document processing).

Given these strengths, organizations are turning to vLLM because it helps solve key challenges in modern LLM deployment. At LinkedIn, we adopted vLLM to power a wide range of GenAI use cases, including text generation and embedding generation. Leveraging vLLM allowed us to create robust infrastructure that supports both online (real-time interactive experiences) and offline (large-scale data processing) GenAI applications.

Today, vLLM supports more than 50 different GenAI use cases at LinkedIn, including LinkedIn Hiring Assistant and AI Job Search. It also runs on thousands of hosts across our platform, enabling everything from search enhancements to AI-powered recommendations—at scale and with efficiency. We’ll dive deeper into these examples in the “LinkedIn Use Cases with vLLM” section, where we explore how LinkedIn leverages vLLM across its AI applications.

In this article, we will share our journey with vLLM—how we've adopted and extended it, our successful user stories, and our contributions back to the open-source community.

Evolution of our GenAI serving stack with vLLM

We began leveraging vLLM in late 2023 to power large-scale GenAI workloads. From the very beginning, our approach wasn’t just about integrating vLLM—it was also about giving internal customers control over key vLLM parameters so they could tune performance without modifying the engine code. With performance tuning capabilities, LinkedIn developers can just build the best config that fits into their applications and reach the best performance. In our serving stack, we’ve exposed several important settings, including:

  • DTYPE – Sets the precision format (e.g. FP16) to balance performance and numerical stability based on hardware capabilities.
  • ENABLE_PREFIX_CACHING – Reuses computation for shared input prefixes, reducing prefill latency and GPU load for workloads with overlapping input tokens.
  • ENABLE_CHUNKED_PREFILL – Breaks large prefills into smaller chunks, helping smooth GPU memory usage and avoid spikes.
  • GPU_MEMORY_UTILIZATION – Controls the fraction of GPU memory allocated for model execution, allowing us to push utilization higher without triggering out-of-memory errors.

By exposing these parameters from day one, we’ve enabled teams across LinkedIn to experiment with optimizations, fine-tune for their specific workloads, and adopt improvements as soon as they become available in vLLM.

Phase 1 – First steps with offline mode

Our initial production deployment ran v0.6.1.post2 in offline mode using the LLM class and engine.step() for basic inference. This allowed us to validate performance and accuracy, but concurrency was limited, making it better suited for low-QPS workloads. 

Phase 2 – Moving to async for higher concurrency

As workload demands increased, we transitioned to AsyncLLMEngine, unlocking asynchronous request handling and better parallelism. This upgrade allowed us to serve significantly higher concurrent request volumes while maintaining stable latency

Phase 3 – Fine-tuning with v0.7.0

We upgraded to v0.7.0 and began performance tuning, notably increasing --num-scheduler-steps from the default to 8. This yielded a ~10% TPS improvement at 1.5 QPS—without any custom kernels or engine modifications—and the option was available for customers to adopt directly in their workloads.

Phase 4 – Transition to v1 engine

We evaluated the v1 engine against v0.7.0 with tuning. Both reached ~1245 tokens/sec under saturation (~10% higher than v0.6.1.post2), saving over 60 GPUs for this workload. We ultimately chose the v1 engine for future readiness, simplified configuration, and better scheduling efficiency at high QPS.

Phase 5 – Migrating to a modular, OpenAI-compatible architecture

We re-architected our serving stack with modular design and OpenAI-compatible API Signature (Figure 1). Before, our serving tightly coupled the server and engine, which limited the flexibility in evolving individual components. This new design decouples our custom gRPC server from the vLLM engine, our custom gRPC server receives incoming requests, maps them to OpenAI-compatible API requests, and forwards them to the vLLM engine client. This decoupling dramatically reduces maintenance burden, and allows us to support advanced features—like image generation—without replicating vLLM server logic.

LinkedIn GRPC Model Server breakdown
Figure 1. LinkedIn GRPC Model Server breakdown

LinkedIn vLLM use cases 

LinkedIn Hiring Agent

In October 2024, we introduced Linkedin Hiring Assistant , highlighting how it streamlines recruiter workflows. This hiring agent assists hirers in their journey to find qualified candidates and automate repetitive tasks in the hiring process. A core functionality of these agents is the ability to take a set of qualifications the hirer has specified, and assist in identifying potential candidates who meet each of those qualifications; from there, the most qualified candidates can be filtered for the hirers. Furthermore, we find that providing hirers with an explanation or evidence in textual form of how the candidate meets a qualification is crucial in building trust.

This kind of structured classification problem is quite complex. Traditional classification models can fail to capture the nuances of many qualifications, like adding up the years of experience a candidate has in a particular area. Furthermore, providing a convincing and terse explanation of this classification is tricky to do. An LLM is able to tackle both of these areas at once, providing both explanations and classifications with high accuracy. However, this requires outputting a lot of tokens (nearly 1000 on average) per candidate and a single hirer may need to evaluate hundreds to thousands of candidates to meet their hiring needs for a single job. Additionally, over 50% of requests share prefix tokens, making them highly amenable to prefix caching optimizations. vLLM reuses computation for the shared portion of the input instead of recalculating it for every request. This dramatically cuts down prefill latency and GPU load, which is critical for our inputs spanning thousands of token inputs.

vLLM provides a high throughput engine for us to affordably serve such a use case at scale. Requests are continuously batched together at each token generation iteration, allowing us to support a very high concurrency of requests with great latency and throughput. As a result, this “high fanout” workload becomes considerably more manageable and affordable at scale.

LinkedIn AI Job Search

AI Job Search is engineered to connect members with highly relevant opportunities by deeply understanding the nuances of their search queries. This system relies on an advanced query understanding process to interpret the underlying intent from user inputs, which are often short, ambiguous, and context-dependent. The core goal is to translate this free-form text into structured interpretations and facet suggestions by cohesively analyzing the user's query along with contextual cues, such as profile attributes. This allows the platform to deliver more accurate, personalized, and context-aware job recommendations within its dynamic marketplace.

Sophisticated query interpretation faces significant challenges. Job search queries are often ambiguous (e.g., "Naples" Florida or Italy), requiring nuanced contextual reasoning. Traditional systems using Named Entity Recognition (NER) models are brittle, costly to maintain, and adapt slowly to evolving language and job taxonomies. LLMs offer superior reasoning and generalization capabilities, capable of unified intent interpretation, facet extraction and facet suggestion in a single model. However, deploying these powerful LLMs to generate rich interpretations and suggestions while meeting stringent low-latency demands (<600ms p95 under thousands of QPS) for a global user base is a key engineering hurdle.

vLLM is pivotal in serving LLM for this advanced query understanding, offering a high-throughput/low-latency serving engine essential for managing such complex tasks affordably and at immense scale. By harnessing vLLM's efficient serving capabilities with both server-side batching and client-side batching, alongside architectural choices like enabling streaming mode for model outputs and the parallel execution of recognized tools once identified from the streamed output, we can support the high concurrency of requests for AI Job Search query understanding. 

LinkedIn and vLLM: today and tomorrow

Today: our contributions to vLLM

At LinkedIn, we continuously work to improve the performance of our inference serving stack. We also actively encourage open-source contributions so the broader community can benefit from these improvements. Recently, two of our performance-focused contributions were released as part of vLLM 0.8.x.

Through profiling vLLM on small models, we observed repeated kernel launches during each forward pass—even with CUDA graphs enabled. This was due to the attention operation being excluded from the CUDA graph for flexibility, but also this left some performance on the table when it comes to small models. In collaboration with Tyler Michael Smith from Red Hat, we submitted a PR to support a full CUDA graph coverage over the entire forward pass by introducing persistent memory buffers for the attention mechanism. This optimization yielded an overall 7% improvement in Time Per Output Token (TPOT).

CUDA Graph before and after optimization
Figure 2. CUDA Graph before and after optimization

We also identified an unnecessary device-to-host memory synchronization in the sampler, where a tensor was being updated in some indexes via a mask (known issue in PyTorch). By refactoring the code to avoid the indexing, we eliminate the sync, enabling significantly better CPU overlapping with GPU. This optimization resulted in an 8% improvement in decoding speed for smaller models.

Reduced extra compute diagram
Figure 3. Reduced extra compute diagram

Future plans

Backed by long-standing collaboration and trust from the community, we remain committed to advancing vLLM’s performance and operational robustness. We’ve partnered closely with organizations such as Red Hat, UC Berkeley Sky Computing, NVIDIA and LMCache to shape the future of vLLM. With our long history of contributing to Open Source, we believe in community over code. 

As we look ahead, we’ll continue pushing the boundaries of what’s possible in large-scale LLM serving—both for LinkedIn and for the broader GenAI community.