Introducing SGLang: Streamlined LLM Programming
Large language models (LLMs) have become indispensable for tackling complex tasks involving intricate generation calls, sophisticated prompting techniques, control flow management, and interactions with external environments. However, the absence of efficient systems for programming and executing these applications remains a critical challenge. Enter SGLang – a Structured Generation Language meticulously crafted to address this gap. SGLang is tailor-made for the efficient programming of LLMs, incorporating primitives specifically designed for common LLM programming patterns. Implemented as a domain-specific language embedded within Python, SGLang comes equipped with a suite comprising an interpreter, a compiler, and a high-performance runtime. These components synergize to facilitate optimizations like parallelism, batching, caching, sharing, and other compilation techniques. Furthermore, we introduce RadixAttention, a pioneering technique that leverages a Least Recently Used (LRU) cache within a radix tree structure to automatically reuse Key-Value (KV) cache across multiple generation calls at runtime. SGLang simplifies LLM program development, significantly enhancing execution efficiency. Our experiments underscore SGLang's prowess, showcasing speed-ups of up to 5x in common LLM tasks, all while reducing code complexity and empowering finer control over operations.