Massive Dataset CommonPool Leaks Sensitive Personal Information
2025-07-31

A new study reveals that CommonPool, a massive dataset containing 12.8 billion image-text pairs, harbors vast amounts of sensitive personal information. This includes credit cards, driver's licenses, passports, birth certificates, resumes, and even sensitive details like medical history and race. Used to train numerous AI models, including Stable Diffusion and Midjourney, CommonPool's over 2 million downloads mean this private information is likely widely disseminated, posing significant privacy risks. Researchers urge greater attention to data privacy and ethical considerations when building large-scale datasets.
AI
dataset