I’m considering ways in which we can clean up the amount of fingerprints we have, and wanted to solicit some feedback before I go ham.
For context, the number of unique fingerprints we have by algorithm are:
MD5: 2,604,879
OSHASH: 13,744,235
PHASH: 3,433,973
The number of user fingerprint submissions by algorithm are:
MD5: 3,320,305
OSHASH: 35,262,357
PHASH: 30,241,373
The question more or less is whether we should keep MD5 and OSHASH around, or purge them and restrict further submissions. For most practical purposes PHASH is superior to both. It doesn’t let you verify files, but that’s not something we can really use hashes for currently either.
The number of scenes which have either OSHASH or MD5 and doesn’t have a PHASH is 1438, which isn’t nothing, but it’s a vanishingly small percentage of all scenes (0.148%).
The advantage of getting rid of them is we save a ton of database space (the scene_fingerprints table indexes are 8GB alone), and it’ll be easier to manage bad fingerprints since there’s way less PHASHes.