SGLang Spins Out as RadixArk with $400 Million Valuation
- Authors

- Name
- Nino
- Occupation
- Senior Tech Editor
The landscape of Large Language Model (LLM) inference is undergoing a seismic shift. Project SGLang, which began as an ambitious open-source research initiative at Ion Stoica’s Sky Computing Lab at UC Berkeley, has officially transitioned into a commercial entity named RadixArk. This spin-out comes with a staggering $400 million valuation and a fresh injection of capital led by Accel, signaling that the battle for inference efficiency is the new frontline of the AI arms race.
The Genesis of RadixArk: Beyond Research
For developers and enterprises using n1n.ai to access high-speed models, the name SGLang is likely familiar. It has quickly become one of the most respected frameworks for serving LLMs, rivaling established giants like vLLM and NVIDIA’s TensorRT-LLM. The transition to RadixArk marks a pivot from academic exploration to industrial scaling.
Ion Stoica, a co-founder of Databricks and Anyscale, has a track record of turning Berkeley research projects into multi-billion dollar enterprises (Apache Spark and Ray being the most notable). With RadixArk, the focus is squarely on solving the 'Inference Bottleneck'—the high cost and technical complexity of running models like Llama 3 or DeepSeek at scale.
The Technical Edge: Why RadixArk Matters
What makes SGLang (now RadixArk) different from other inference engines? The secret sauce lies in its architecture, specifically RadixAttention. Unlike standard PagedAttention used in vLLM, RadixAttention treats the KV (Key-Value) cache as a radix tree. This allows for automatic prefix caching across multiple requests.
When you use an API aggregator like n1n.ai, latency is often determined by how quickly the engine can process the prompt prefix. If multiple users are sending requests with the same system prompt or few-shot examples, RadixArk doesn't recompute those tokens. It retrieves them from the cache instantly. This can lead to throughput improvements of up to 5x in complex, multi-turn conversations or structured data extraction tasks.
Key Architectural Innovations:
- Structured Generation: SGLang allows developers to define the output format (like JSON) using a domain-specific language that accelerates the decoding process.
- Compressed KV Cache: Efficient management of memory allows for larger batch sizes even on consumer-grade hardware.
- Multi-GPU Orchestration: Built-in support for tensor parallelism and pipeline parallelism ensures that massive models can be served with minimal overhead.
The $400M Valuation and the Inference Explosion
The funding from Accel highlights a critical trend: while 2023 was the year of training, 2024 and 2025 are the years of inference. Enterprises are moving from 'playing with models' to 'deploying production applications.' In this phase, the cost per 1M tokens becomes the most important metric.
RadixArk enters a market where efficiency translates directly into profit. By optimizing how tokens are generated, they allow companies to reduce their GPU cloud bills by 30-70%. For developers utilizing the n1n.ai platform, these optimizations mean lower prices and higher reliability for the top-tier LLMs they rely on.
Implementation: Leveraging SGLang/RadixArk Logic
To understand the power of the RadixArk approach, consider a scenario where you need to generate a structured JSON object. Using standard libraries, the model might waste tokens on unnecessary whitespace or incorrect formatting. With SGLang's logic, the constraints are baked into the inference process.
# Example of SGLang structured generation logic
from sglang import function, system, user, assistant, gen, select
@function
def multi_choice_qa(s, question, options):
s += system("You are a helpful assistant.")
s += user(question)
s += assistant("The answer is " + select("answer", options))
# This ensures the model ONLY outputs one of the provided options,
# reducing latency and ensuring 100% reliability.
This level of control is what makes RadixArk a formidable competitor to OpenAI's proprietary structured output features.
Comparison: RadixArk vs. vLLM vs. TensorRT-LLM
| Feature | RadixArk (SGLang) | vLLM | TensorRT-LLM |
|---|---|---|---|
| Caching Mechanism | Radix Tree (Automatic) | PagedAttention | Manual Management |
| Ease of Use | High (Python-native) | High | Medium (Complex Build) |
| Structured Output | Native Support | Limited | Via external libs |
| Performance | Best for complex/long context | Best for general throughput | Best for raw NVIDIA speed |
Pro Tips for Developers
- Prefix Caching is King: If your application uses long system prompts (e.g., a 2000-token knowledge base), SGLang/RadixArk will be significantly faster than standard engines because it only processes those tokens once.
- Monitor Throughput: When scaling, don't just look at 'Time to First Token' (TTFT). Look at 'Tokens Per Second' (TPS) under load. RadixArk excels at maintaining high TPS even when the GPU is at 90% utilization.
- Use n1n.ai for Testing: Before committing to a specific hosting provider, use n1n.ai to test different backend implementations. This allows you to benchmark performance without managing the infrastructure yourself.
Conclusion: The Future of Efficient AI
The spin-out of RadixArk is more than just a corporate move; it is a maturation of the AI ecosystem. As SGLang evolves into a commercial product, we can expect even tighter integrations with hardware accelerators and better support for the latest models like DeepSeek-V3 and Llama-3.1.
For the developer community, this competition is a win. It drives down costs and raises the bar for what we consider 'fast' AI. Whether you are building an autonomous agent or a simple chatbot, keeping an eye on RadixArk's developments will be crucial for maintaining a competitive edge.
Get a free API key at n1n.ai