Fingerprint cleanup and algorithm restrictions

I’m considering ways in which we can clean up the amount of fingerprints we have, and wanted to solicit some feedback before I go ham.

For context, the number of unique fingerprints we have by algorithm are:

MD5: 2,604,879
OSHASH: 13,744,235
PHASH: 3,433,973

The number of user fingerprint submissions by algorithm are:

MD5: 3,320,305
OSHASH: 35,262,357
PHASH: 30,241,373

The question more or less is whether we should keep MD5 and OSHASH around, or purge them and restrict further submissions. For most practical purposes PHASH is superior to both. It doesn’t let you verify files, but that’s not something we can really use hashes for currently either.

The number of scenes which have either OSHASH or MD5 and doesn’t have a PHASH is 1438, which isn’t nothing, but it’s a vanishingly small percentage of all scenes (0.148%).

The advantage of getting rid of them is we save a ton of database space (the scene_fingerprints table indexes are 8GB alone), and it’ll be easier to manage bad fingerprints since there’s way less PHASHes.

1 Like

I would say to drop MD5 and OSHASH. They both change when you transcode anyway and unreliable (in my opinion).

Adding in the ability for users to vote to remove bad PHASH will help in cleaning up the bad fingerprints on the PHASH side of things. You could potentially set it up to allow admins/mods the ability to just remove them without vote similar to how we have it at TPDB.

One clear advantage of oshash is that you can match without generating phash. MD5 I’m all for dropping, oshash effectively does the same thing at much cheaper cost and is significantly less prone to fluctuation