Extracting Training Data from LLMs: Reversing the Knowledge Compression

Popular：

Virtualization DNS security formal verification reachability analysis compiler errors macro conflict web extension development framework Bitmap Graphics API inconsistencies All Tags

Extracting Training Data from LLMs: Reversing the Knowledge Compression

2025-09-20

Researchers have developed a technique to extract structured datasets from large language models (LLMs), effectively reversing the process by which LLMs compress massive amounts of training data into their parameters. The method uses hierarchical topic exploration to systematically traverse the model's knowledge space, generating training examples that capture both factual knowledge and reasoning patterns. This technique has been successfully applied to open-source models like Qwen3-Coder, GPT-OSS, and Llama 3, yielding tens of thousands of structured training examples. These datasets have applications in model analysis, knowledge transfer, training data augmentation, and model debugging. This research opens new avenues for model interpretability and cross-model knowledge transfer.

(www.scalarlm.com)

Scream Cipher: A Novel Use of Unicode Characters

Claude Code: An Unexpected Breakthrough in AI-Assisted Interactive Theorem Proving