Personally identifiable information has been found in DataComp CommonPool, one of the largest open-source data sets used to train image generation models. Millions of images of passports, credit cards ...
Old family documents and photos often contain valuable information, but handwriting can be hard to decipher, records may be in unfamiliar languages, and portraits or gravestones may lack context.