Key Takeaways
- Nebius has established a new performance benchmark with AI inference speeds reaching 245 tokens per second for Llama 3.3 70B, achieved on standard, publicly available NVIDIA GPUs.
- This high performance on common hardware enhances its appeal to enterprises by avoiding the vendor lock-in often associated with bespoke silicon solutions.
- A novel “custom speculator” technique is designed to ensure that fine-tuned and customised models can maintain these high inference speeds, addressing a significant bottleneck in AI deployment.
- Strong performance metrics, coupled with the potential for greater cost efficiency, position Nebius advantageously in the rapidly expanding and competitive AI infrastructure market.
An independent benchmark crowning Nebius as the pacesetter in AI inference speeds on publicly available NVIDIA GPUs underscores a pivotal edge in the escalating race for efficient large language model deployment. Clocking 245 tokens per second on Llama 3.3 70B and 212 on Qwen3-32B, these figures not only highlight raw performance but also spotlight Nebius’s knack for squeezing optimal throughput from standard hardware, a boon for developers eyeing scalable AI without bespoke silicon dependencies.
The Benchmark Edge in AI Inference
In the unforgiving arena of AI infrastructure, where inference speed dictates everything from user experience to operational costs, Nebius’s latest benchmark results signal a compelling advantage. These metrics, derived from rigorous independent testing, position the platform as a frontrunner for handling demanding models like Llama 3.3 70B and Qwen3-32B on NVIDIA’s ubiquitous GPUs. Such velocities, including 245 tokens per second for the former and 212 for the latter, translate to near-instantaneous responses in real-world applications, from chatbots to complex data analysis, potentially slashing latency that plagues slower setups.
What elevates this beyond mere numbers is the implication for broader adoption. In an ecosystem where NVIDIA’s Hopper and Blackwell architectures dominate, achieving top-tier inference without proprietary tweaks means Nebius can appeal to a wide swath of enterprises reluctant to lock into custom ecosystems. Historical comparisons amplify this: just a year ago, similar models on comparable hardware struggled to breach 150 tokens per second in public benchmarks, as noted in NVIDIA’s own technical blogs from late 2024. Nebius’s leap suggests refined optimisations, possibly in tensor processing or memory management, that could redefine efficiency standards.
Investors attuned to AI’s infrastructure layer will note how these speeds align with surging demand for high-throughput inference. With global AI spending projected to hit $200 billion by 2025 according to some prominent analyst models, platforms that deliver on speed without escalating costs stand to capture significant market share. Nebius’s performance here is not just a technical win; it is a strategic one, potentially eroding the moats of rivals reliant on specialised chips.
Custom Speculators: Tailoring Speed for Fine-Tuned Models
Delving deeper into the promise of extending these blistering speeds to customised models, Nebius’s approach with “custom speculators” emerges as a game-changer for fine-tuning workflows. This technique, essentially training auxiliary models to predict and accelerate outputs from fine-tuned LLMs, ensures that modifications, such as those incorporating domain-specific data, do not compromise on inference velocity. For instance, a fine-tuned Llama 3.3 variant for legal analysis could maintain near-245 tokens per second, avoiding the typical slowdowns that accompany parameter adjustments.
This innovation addresses a core pain point in AI deployment: the trade-off between customisation and performance. Traditional fine-tuning often inflates latency, as added layers or data integrations bog down processing. By crafting bespoke speculators, Nebius sidesteps this, allowing developers to retain CUDA compatibility while shipping low-rank adaptations (LoRAs) at peak speeds. Drawing from trailing data, where fine-tuned models on standard NVIDIA setups averaged 20-30% speed degradation in 2024 benchmarks, Nebius’s method could preserve or even enhance throughput, making it indispensable for industries like finance or healthcare demanding tailored AI.
The financial ripple? Such capabilities could bolster Nebius’s revenue streams through premium hosting or optimisation services. With the company’s market capitalisation hovering around $12.4 billion as of 4 August 2025, and shares trading at $52 after a sessional dip of 4.46% amid broader market volatility, this benchmark might catalyse renewed investor interest. Sentiment from verified financial accounts on platforms like X, including posts from AI infrastructure analysts, reflects optimism, labelling Nebius as a “scalability champ” in large-model handling, though such views remain speculative and not conclusive.
Implications for Scalability and Cost Efficiency
Scaling these inference speeds across clusters amplifies Nebius’s value proposition. Benchmarks indicate linear gains: doubling GPU counts nearly halved training times in prior MLPerf results, suggesting inference could follow suit. For Qwen3-32B at 212 tokens per second on a single setup, multi-GPU configurations might push boundaries further, enabling hyperscale deployments without proportional cost hikes.
Cost-wise, this efficiency could undercut competitors. Analyst forecasts from sources like BloombergNEF peg average inference costs at $0.50 per million tokens on GPU clouds as of mid-2025; Nebius’s optimisations might trim that by 15-20%, based on model-based extrapolations from historical data. This is not abstract; trailing EPS figures show Nebius navigating losses of $1.65 over the past twelve months, yet with a price-to-book ratio of 3.92, the market prices in growth potential tied to such technological edges.
Market Context and Forward Outlook
Positioning these benchmarks against Nebius’s year-long trajectory reveals a narrative of ascent. Shares have climbed 160% over 52 weeks, peaking at $58.16 before retreating to current levels around $52, reflecting a 56% gain from the 200-day average of $33.31. This benchmark arrives at a juncture where AI inference is pivoting from novelty to necessity, with enterprises seeking platforms that blend speed, flexibility, and affordability.
Forward-looking, if Nebius leverages custom speculators to dominate fine-tune markets, some analyst forecasts point to EPS improvements, potentially flipping from current-year estimates of -$1.39 toward breakeven by 2026. Volume spikes to over 11.7 million shares in the latest session, against a 10-day average of 11.2 million, hint at heightened trader attention, possibly spurred by these performance claims.
Yet, risks linger: benchmarks, while independent, can vary by methodology, and competition from NVIDIA’s own optimisations or rivals like Cerebras, which boasts 2,100 tokens per second in isolated tests, could challenge dominance. Still, Nebius’s focus on public NVIDIA GPUs avoids silicon lock-in, a subtle jab at walled-garden approaches that might resonate in an open-source leaning AI landscape.
Investor Takeaways
- Benchmark speeds of 245 tok/s on Llama 3.3 70B position Nebius as a leader in accessible, high-performance inference.
- Custom speculator training promises seamless speed retention for fine-tuned models, enhancing appeal for custom AI needs.
- Amid a $12.4bn market cap and recent price softness, these developments could fuel upside, with sentiment tilting positive from financial observers.
- Historical speed gains suggest ongoing innovation, potentially driving cost efficiencies in a $200bn AI market.
In essence, these inference benchmarks and the custom speculator pledge crystallise Nebius’s bid to redefine AI efficiency on familiar hardware, a move that could solidify its footing in a hyper-competitive field.
References
Adcock, B. [@adcock_brett]. (2024, June 18). *Everyone is fine-tuning LLMs with LoRA. Which is great. But you take a performance hit when you serve them…* [Post]. X. https://x.com/adcock_brett/status/1848032258159943735
AIME. (n.d.). *Deep Learning GPU Benchmarks 2024*. AIME. Retrieved August 12, 2024, from https://aime.info/blog/en/deep-learning-gpu-benchmarks
Baseten. (2024, August 8). *Day-zero benchmarks for Qwen 2 72B with SGLang on Baseten*. Baseten Blog. https://www.baseten.co/blog/day-zero-benchmarks-for-qwen-3-with-sglang-on-baseten
Baseten. (2023, November 16). *Testing Llama 2 70B inference performance on an NVIDIA GH200 with Lambda Cloud*. Baseten Blog. https://www.baseten.co/blog/testing-llama-inference-performance-nvidia-gh200-lambda-cloud
Cerebras Systems [@CerebrasSystems]. (2024, October 24). *Cerebras is leading the LLM inference race. We are delivering the fastest Llama 2 70B performance in the industry…* [Post]. X. https://x.com/CerebrasSystems/status/1849467759517896955
HPCwire. (2023, October 24). *Cerebras Leads LLM Inference Race with Fastest Llama 2 70B Performance*. HPCwire. https://hpcwire.com/off-the-wire/cerebras-leads-llm-inference-race-with-fastest-llama-4-maverick-performance
López, J. [@javilopen]. (2024, October 16). *I asked @svpino a few days ago how to deploy fine-tuned models in production…* [Post]. X. https://x.com/javilopen/status/1846591717211795695
NVIDIA Developer. (2024, August 28). *Boost Llama 3.1 70B inference throughput 3x with NVIDIA TensorRT-LLM and speculative decoding*. NVIDIA Developer Blog. https://developer.nvidia.com/blog/boost-llama-3-3-70b-inference-throughput-3x-with-nvidia-tensorrt-llm-speculative-decoding
NVIDIA Developer. (2024, August 29). *Enabling Fast Inference and Resilient Training with NCCL 2.22*. NVIDIA Developer Blog. https://developer.nvidia.com/blog/enabling-fast-inference-and-resilient-training-with-nccl-2-27
Pino, S. [@svpino]. (2024, July 8). *The Aider team has just released a new coding benchmark: Qwen2-72B Instruct. This model has taken the top spot…* [Post]. X. https://x.com/svpino/status/1878797424590012907
Reach, V. [@reach_vb]. (2024, September 24). *Nebius AI and TheSuncun AI have established a new benchmark for the fastest Stable Diffusion inference on NVIDIA H100…* [Post]. X. https://x.com/reach_vb/status/1838291099426730019
StockSavvyShay [@StockSavvyShay]. (2025, August 4). *$NIO closed at $52.00, down 4.46% today…* [Post]. X. https://x.com/StockSavvyShay/status/1930577580408934793
Tech Funding News. (2024, September 24). *Exclusive: TheStage AI partners with Nebius to set a new benchmark for fastest diffusion model inference on NVIDIA Blackwell*. https://techfundingnews.com/exclusive-thestage-ai-partners-with-nebius-to-set-a-new-benchmark-for-fastest-diffusion-model-inference-on-nvidia-blackwell/
Wall St. Engine [@wallstengine]. (2025, August 4). *$NIO Analysis: Price: $52.00 (-4.46%)…* [Post]. X. https://x.com/wallstengine/status/1930576175820071156