SeedLM: A Novel LLM Weight Compression Method Using Pseudo-Random Number Generators

Large Language Models (LLMs) are hindered by high runtime costs, limiting widespread deployment. Meta researchers introduce SeedLM, a novel post-training compression method using seeds from a pseudo-random number generator to encode and compress model weights. During inference, SeedLM uses a Linear Feedback Shift Register (LFSR) to efficiently generate a random matrix, linearly combined with compressed coefficients to reconstruct weight blocks. This reduces memory access and leverages idle compute cycles, speeding up memory-bound tasks by trading compute for fewer memory accesses. Unlike state-of-the-art methods requiring calibration data, SeedLM is data-free and generalizes well across diverse tasks. Experiments on the challenging Llama 3 70B show zero-shot accuracy at 4- and 3-bit compression matching or exceeding state-of-the-art methods, while maintaining performance comparable to FP16 baselines. FPGA tests demonstrate that 4-bit SeedLM approaches a 4x speed-up over an FP16 Llama 2/3 baseline as model size increases.