Media Summary: Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam ... Real-time AI is powerful—but expensive. In this episode, we discuss, how A walkthrough of some of the options developers are faced with when building applications that leverage

Batch Inference For Open Source Llms Faster Cheaper Scalable - Detailed Analysis & Overview

Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam ... Real-time AI is powerful—but expensive. In this episode, we discuss, how A walkthrough of some of the options developers are faced with when building applications that leverage Hey everyone, In this video, I showcase how Ready to serve your large language models Speaker: Maksim Khadkevich, Sr. Software Engineering Manager, Dynamo, NVIDIA Khadkevich discusses data center

Here's the one change that took mine from ~120 tok/s to 1200+ without a new GPU. TryHackMe just launched Cyber Security 101 ... This is the stack that gets me over 4000 tokens per second locally. Download Docker Desktop here: to ... The popularity of machine learning (ML) in the real world has exploded recently, with offline Download the AI model guide to learn more → Learn more about the technology →

Photo Gallery

Batch Inference for Open-Source LLMs: Faster, Cheaper, Scalable
What is vLLM? Efficient AI Inference for Large Language Models
Faster LLMs: Accelerate Inference with Speculative Decoding
Stop Using Real-Time AI for Everything — Try Batch Inference Instead
LLM Batch Inference in Python with Ray Data: Run Large Eval Jobs Faster
Insanely Fast LLM Inference with this Stack
Inference Is the Bottleneck Now: How to Architect LLM Serving in 2026 (vLLM, GPUs, Decentralized)
LLM as a Judge: Scaling AI Evaluation Strategies
Optimize LLM inference with vLLM
Fast LLM Serving with vLLM and PagedAttention
Improving LLM Throughput via Data Center-Scale Inference Optimizations
Your local LLM is 10x slower than it should be
View Detailed Profile
Batch Inference for Open-Source LLMs: Faster, Cheaper, Scalable

Batch Inference for Open-Source LLMs: Faster, Cheaper, Scalable

Run

What is vLLM? Efficient AI Inference for Large Language Models

What is vLLM? Efficient AI Inference for Large Language Models

Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam ...

Faster LLMs: Accelerate Inference with Speculative Decoding

Faster LLMs: Accelerate Inference with Speculative Decoding

Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam ...

Stop Using Real-Time AI for Everything — Try Batch Inference Instead

Stop Using Real-Time AI for Everything — Try Batch Inference Instead

Real-time AI is powerful—but expensive. In this episode, we discuss, how

LLM Batch Inference in Python with Ray Data: Run Large Eval Jobs Faster

LLM Batch Inference in Python with Ray Data: Run Large Eval Jobs Faster

Scale LLM batch inference

Insanely Fast LLM Inference with this Stack

Insanely Fast LLM Inference with this Stack

A walkthrough of some of the options developers are faced with when building applications that leverage

Inference Is the Bottleneck Now: How to Architect LLM Serving in 2026 (vLLM, GPUs, Decentralized)

Inference Is the Bottleneck Now: How to Architect LLM Serving in 2026 (vLLM, GPUs, Decentralized)

Hey everyone, In this video, I showcase how

LLM as a Judge: Scaling AI Evaluation Strategies

LLM as a Judge: Scaling AI Evaluation Strategies

Ready to become a certified watsonx AI Assistant Engineer? Register now and use code IBMTechYT20 for 20% off of your exam ...

Optimize LLM inference with vLLM

Optimize LLM inference with vLLM

Ready to serve your large language models

Fast LLM Serving with vLLM and PagedAttention

Fast LLM Serving with vLLM and PagedAttention

LLMs

Improving LLM Throughput via Data Center-Scale Inference Optimizations

Improving LLM Throughput via Data Center-Scale Inference Optimizations

Speaker: Maksim Khadkevich, Sr. Software Engineering Manager, Dynamo, NVIDIA Khadkevich discusses data center

Your local LLM is 10x slower than it should be

Your local LLM is 10x slower than it should be

Here's the one change that took mine from ~120 tok/s to 1200+ without a new GPU. TryHackMe just launched Cyber Security 101 ...

How to Scale LLM Applications With Continuous Batching!

How to Scale LLM Applications With Continuous Batching!

If you want to deploy an

THIS is the REAL DEAL 🤯 for local LLMs

THIS is the REAL DEAL 🤯 for local LLMs

This is the stack that gets me over 4000 tokens per second locally. Download Docker Desktop here: https://dockr.ly/4mOdGMO to ...

Faster and Cheaper Offline Batch Inference with Ray

Faster and Cheaper Offline Batch Inference with Ray

The popularity of machine learning (ML) in the real world has exploded recently, with offline

AI Inference: The Secret to AI's Superpowers

AI Inference: The Secret to AI's Superpowers

Download the AI model guide to learn more → https://ibm.biz/BdaJTb Learn more about the technology → https://ibm.biz/BdaJTp ...

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

Mastering LLM Inference Optimization From Theory to Cost Effective Deployment: Mark Moyou

LLM inference

The Fastest LLMs in 2025

The Fastest LLMs in 2025

Fastest

Optimizing LLM Inference Requests

Optimizing LLM Inference Requests

Our new book club series is about

Fast, Cheap, and Accurate: Optimizing LLM Inference with vLLM and Quantization by Legare Kerrison

Fast, Cheap, and Accurate: Optimizing LLM Inference with vLLM and Quantization by Legare Kerrison

Fast