Massive Dataset CommonPool Leaks Sensitive Personal Information

2025-07-31
Massive Dataset CommonPool Leaks Sensitive Personal Information

A new study reveals that CommonPool, a massive dataset containing 12.8 billion image-text pairs, harbors vast amounts of sensitive personal information. This includes credit cards, driver's licenses, passports, birth certificates, resumes, and even sensitive details like medical history and race. Used to train numerous AI models, including Stable Diffusion and Midjourney, CommonPool's over 2 million downloads mean this private information is likely widely disseminated, posing significant privacy risks. Researchers urge greater attention to data privacy and ethical considerations when building large-scale datasets.

AI dataset