Anomalous Tokens in DeepSeek: A Catalog of Glitches

2025-01-25
Anomalous Tokens in DeepSeek: A Catalog of Glitches

A researcher has uncovered a trove of 'anomalous tokens' within the open-source large language model DeepSeek-V3 and r1. These tokens, when inputted, cause the model to exhibit bizarre behavior, such as replacing words with unusual Unicode characters, acronyms, or emojis. The researcher systematically tested each token from DeepSeek's vocabulary, identifying and categorizing these glitches. Some tokens, dubbed 'fragment tokens,' only show anomalies in specific contexts. Others, like 'Nameeee' and 'EDMFunc', consistently produce peculiar substitutions. 'Nameeee' frequently yields 'M'-related words or symbols, while 'EDMFunc' favors words starting with 'H' and Japanese names. Numerous non-English anomalous tokens, primarily from Cebuano and other Philippine languages, were also found. Special tokens like '<|end of thinking|>' can further disrupt the model's functionality. This research offers valuable insights into the inner workings of LLMs and opens avenues for further investigation.