KVSplit: Differentiated KV Cache Quantization for Apple Silicon

2025-05-16
KVSplit: Differentiated KV Cache Quantization for Apple Silicon

KVSplit optimizes LLMs on Apple Silicon by applying different quantization precision to keys vs. values in the attention mechanism's KV cache. This allows for significant memory reduction (up to 72%) with minimal quality loss. The K8V4 configuration (8-bit keys, 4-bit values) offers the best balance, achieving a 59% memory reduction with only a 0.86% perplexity increase and faster inference. KVSplit includes an easy installer and a comprehensive benchmark suite to evaluate different configurations, enabling longer context windows and larger models on Apple devices.

Development