KVSplit: Differentiated KV Cache Quantization for Apple Silicon
2025-05-16
KVSplit optimizes LLMs on Apple Silicon by applying different quantization precision to keys vs. values in the attention mechanism's KV cache. This allows for significant memory reduction (up to 72%) with minimal quality loss. The K8V4 configuration (8-bit keys, 4-bit values) offers the best balance, achieving a 59% memory reduction with only a 0.86% perplexity increase and faster inference. KVSplit includes an easy installer and a comprehensive benchmark suite to evaluate different configurations, enabling longer context windows and larger models on Apple devices.
Development