Tokenization Problem Proven NP-Complete, Doubling Data Compression Challenges
2024-12-22
A paper published on arXiv proves the NP-completeness of two variants of tokenization, defined as the problem of compressing a dataset to at most δ symbols by either finding a vocabulary directly (direct tokenization) or selecting a sequence of merge operations (bottom-up tokenization). This finding has significant implications for data compression and natural language processing, highlighting the immense challenge of efficiently solving the tokenization problem for large-scale datasets.