The Curious Case of Emoji Length in JavaScript: UTF-8, UTF-16, UTF-32, and Grapheme Clusters

2025-08-22

This article delves into the discrepancies in emoji string length across different programming languages. For example, in JavaScript "🤦🏼‍♂️".length is 7, while in Python it's 5 and in Rust, 17. This stems from variations in how languages handle string encoding (UTF-16, UTF-8, etc.) and character units (Unicode scalar values, extended grapheme clusters, etc.). The author argues that remembering the length in the native encoding is reasonable, but other lengths (like extended grapheme clusters) should be computed on demand to avoid unnecessary storage overhead and synchronization issues. The article further analyzes the pros and cons of different encoding schemes, highlighting UTF-8's advantages in storage and interchange. Finally, it tackles the issue of fair length quotas, demonstrating that there's no simple way to fairly measure information density across languages, illustrating this with translations of the Universal Declaration of Human Rights.

Read more
Development String Encoding