Improving scanning / identifying performance for large collection

Hi, I’ve been using Stash for a few weeks and I’m slowly getting a hang of this. I’ve been testing and learning how to use Stash and plug-ins and I’m ready to feed it a large collection of ~100tb collection with maybe 100k files. Before I do I want to find out more about how it works so I can optimize performance.

Right now when I bring it files I go to Settings>Task>Scan, after which I go to Settings>Task>Identify. During scan I only have Stash generate scene cover and phash, which takes a long time.

Generating phash seems to be the most time-consuming part of the process. I’m assuming it is fairly processor and disk read intensive. Is the phash function opmitized for multi-core or multi-threaded processors? If so is there a point of diminishing returns for core / thread count? Right now I’m running it on a spare 8th gen Intel box. I have some other spare PCs around and I’m wondering if it is worthwhile to throw a 12 or 16 core processor at it since I’ll be processing a large library. For the Identify part of the task, all it is is Stash sending the phash to StashDB and fetch data if a match is found, right? Modern home broadband should not bottleneck the Identifying part?

The other question is if it makes more sense to bring in scenes in batched. Right now the library is on a NAS, and Stash is running on a spare PC. If I want to start adding this whole ~100TB collection, does it make sense to move a few TB to the machine running Stash, have Stash run stand / identify / rename / organize tasks off of a local SSD, and once done, move them back to the NAS? If so, How should I set up the library and what would be the workflow like?

Thanks

PHashing might not be absolutely required to match by Stashboxes like StashDB. If the files are unmodified (like not compressed or resized or whatever in any way) from a commonly used source (like of course the original studio site you bought them from), they might just match simply by file hash alone (this is much much quicker to calculate, and part of the usual scanning process). So it might be an approach to just do a scan without phashing first, and then try to match.

There is a setting somewhere where you can configure number of parallel tasks, it might be worth trying increasing that maybe to CPU core count, to maximize CPU utilization, running multiple processes of everything side by side. And obviously, faster CPUs help, but keep in mind more cores are not necessarily faster (like current 8 core Ryzens runs circles around 3000 series 16 core Ryzen)

Perceptional hashing is essentially playing back video as quickly as possible, and feeding the images through the hash function. So there is three components: reading the file from storage, decoding it, and calculating the PHash. The slowest of these three steps will bottleneck the entire process. So finding out what your Bottleneck is is key, through monitoring CPU utilization, network throughput, raw NAS speeds, maybe even NAS CPU utilization.

For storage: even when you transfer it to a local SSD first, you have to read it from the NAS once either way, might as well let Stash read it directly. I’d explore options to make NAS access faster if possible. Does your NAS support faster Ethernet speeds like 2.5G, 5G or even 10G Ethernet? IF the network connection is the bottleneck, this might speed things up a lot (even 2.5G is already a huge improvement, 1G usually caps out around 100 MB/s over SMB). Only makes sense though if the NAS itself can sustain these speeds (usually requires a lot of spinning disks, a decent CPU, and no internal bottlenecks like slow decryption - looking at you, Synology!).

For decode: 8th gen Intel should support QuickSync, I’d look into if there is a way to get PHashing / ffmpeg to use the dedicated video decoder accelerator hardware in the CPU.

Sorry this is a bit unsorted, hope it gives you some starting points. 100 TB is definitely going to be a big project to organize into a neatly tagged collection.

2 Likes

phash is sequential and single-threaded and like spaceyuck said, it’s also bottlenecked by IO, you’ll have to check your iowait to see if moving it locally would reap any benefits.

you can look towards this issue to add hardware acceleration, which increases the speed tremendously. Unfortunately it is also extremely fragile so I can’t offer much advice on how to implement it properly.

Alterantively, there are projets like darklyfter’s videohash node that will let you process off-device

1 Like

my general strategy for this is to turn off as many of the “on scene add” items as i can, no thumbs, vtt, etc, just get the files in the library and hashed. I focus on setting my scan up to run as quickly as my hardware will allow it, my bottle neck is my array, which is all on spinning metal.

From there, I run batches of the generate tasks as I go (as in JUST generate scene covers and thumbs, for example). And you can queue these up and just let it toil through it.

I also recommend checking your plugins, on large imports each one can add a lot of tasks to the queue that can make processing scenes take even longer. On major imports it may be worth looking at plugins you can turn off and back on when done.

My advice is very much tailored to my setup and my skill and willingness to deal with edge cases. There may be better ways to do this, but for me this has proven effective.

3 Likes

This is all very helpful. Much appreciated.

I’ve got an old 12 bay Synology and I’m maxed at 1gbp connection so network is my bottleneck.

Right now Stash is running on an i7 8700, but I could repurpose an AMD 5700G or 5900X.

The multi-node script looks very interesting. I’m already running Tdarr on 3 nodes so I could use it for Stash, assuming I can figure out how to set it up. Unfortunately I’ve transcoded all of the files so far so they all have to be phashed and filehash alone won’t work.

I’ve also turn off all the generate tasks except cover and phash, and the only plugin I’m running is the renamer tool that organize the files. Might turn that off too since it is adding duplicates to the file name and I have to figure out how to fix it.

Thanks for all the good info so far. Much appreciated.

Also, incomplete or corrupted files take forever to “process”. Identifying and removing those files from my library was the single biggest improvement to scanning speed.

2 Likes

I am running into the same issue here. I am currently “identifying” my entire library. I am using torrent files as my source and only adding tags so no web sources are involved so I hope it isn’t some sort of hard coded rate-limiting to prevent DDOSing. Stash is processing one file at a time and each is taking ~3 seconds.

Sounds like you didn’t generate phashes beforehand and it’s generating them as they’re being scraped. If you generate phashes it will take much less time between iterations

I generated phashes while I was importing/scanning everything into Stash originally. This is a second “identify” pass I am doing against the torrent files strictly to update tags that are contained in the .torrent file metadata.

If you’re using the torrent scraper to read the torrents, then you’re not hitting a limitation of stash but of python and bencode. If you use cython you’ll see much better speeds with bdecode