Automated Mass Organization

I have a pretty big collection at this point (~60k scenes), I’m trying to set up a way to automate the mass organization of scenes. I have about 11k that have been scanned in but not organized. I know there are some built in tools for this, but they aren’t quite what I want. I have done some mass automation in the past using selenium scripts that can go in and scrape using essentially the UI, but it’s a little weird using selenium on something that runs more or less locally, and it’s pretty fragile.

I’m comfortable with coding, I guess my main question is if there is a way to do what the ‘scrape’ button does programmatically, comparing phashes against stashdb or tpdb using my APIkey and everything. This way I can just get a response and enter the metadata into my SQL database directly and have done with it. I’m a little concerned about scene covers using this method though. Anyway, any guidance would be appreciated.

Everything Stash does on the frontend is done programmatically using GraphQL. In the browser, you can use Developer Tools > Network tab to inspect what it does.

Check out the API documentation/playground for more details.

As @DogmaDragon mentioned, there’s the GraphQL API that gives you a lot more freedom to do what you want.

I’m more partial to SQL though. There’s nothing stopping you from using a sqlite client to connect to the database and bulk update records. It’s a lot faster too. Having said that, if you’re not familiar or comfortable with using SQL I’d steer clear – there’s absolutely no guardrails and the potential to corrupt or lose all your (meta)data is high.

1 Like

To add to that Stash API supports execSQL mutation which can be used to run straight SQL too.

Dogma kindly re-opened this topic. There are ways to bulk tag and organize at scale. I’m talking about methods that scale above 100,000 scenes. The methods involve coding and data management. However, mostly it requires discipline.

I’m back and I can provide more guidance on these methods. However, you may not like the answers. The methods will not help you organize your existing stash. The methods work for downloading and tagging content.

You may be better off deleting everything you have and starting over using new methods. This is where things get tough. the coding and technical stuff is actually easier. The hard part is to overcome the sunk cost fallacy.

I suggest you read this post from Ronnie711.

Note the part about nuking his 32,000 scene stash and starting again.
If you read on you will note his point on tagging performers.

I have bad news. There is no good automated way to reliably tag performers at scale. Simple methods that work on matching names that consist of ASCII string have limitations. All scrapers have this limitation.

In order to tag performers I had to start my own project to catalog and manage performers from all my studios (and others). This involved creating my own performer database. The initial cost was around $1000 on AWS and I need to spend another $1000 to handle the indexing.

I decided to manage performers using facial recognition. It’s not perfect but it can handle issues which are hard using ASCII.

I’m surprised by the amount of your AWS costs. Would you mind sharing the breakdown?

Thanks for all your responses! I haven’t used the Identify feature basically at all because I was uncomfortable with the accuracy. My current process, after implementing some code thanks to Dogma’s answers, is this:

  • Run my code, which fetches data from stashdb if available, if not, then TBDB if available, and runs some basic cleanup of titles and tags to my preferences.
  • Code logs any problems, eg no data on either box, missing performers, missing studios, that sort of thing.
  • I resolve errors entirely manually. The biggest offender is usually Bang scenes that come from old movie studios. I use iafd to hopefully get some semblance of ok data on them, and adultdvdempire for movie box art. Otherwise I just try my best to research when and where the scene comes from.
  • Then I go through the non errored scenes one by one and make sure what I’m looking at makes sense. Right studio, right performers, etc, just in case there was a mismatch.
  • Then and only then I mark them as organized and move on.

Before I manually went through and scraped each scene, this is much faster. For some reason I’m more disciplined about my porn collection than half the other aspects of my life but here we are.
That said there are still plenty of errors I’ve found, but I’m one human doing manual processes, there will always be errors. I just try to correct them and move on.
I basically have four pieces of data I require of a scene for me to consider it done and organized: Title, Performer(s), Studio, and Date. I also try to grab the movie if it had a dvd release. The only one I truly have trouble with is Date. The sheer amount of bad data out there is mind boggling.

I guess what I was mainly looking for was to make the starting point of the manual review as high as possible, so that the manual part is as quick and painless as I can make it.

The biggest offender is usually Bang scenes that come from old movie studios

Bang can be completely scraped an easily downloaded via script. It is possible to directly load metadata into groups (fka movies) from Bang metadata and automatically associate the movies with the scenes. You do not need, nor should you use, AdultDVDEmpire for Bang distributed scenes.

AdultDVDempire can also be scraped and you can keep your own database of all its metadata but you need to know a few tricks. It’s been done. It is also possible to mass download from ADE but how you do it depends on your sub and access.

But Bang is easy. I have posted scrapes of their metadata and downloaded their content and assets.

I may have posted the breakdown before on discord but here is rough estimates.
It costs 1/10 cent to add a photo to the database and include it in an index of face vectors
It costs 1/10 cent to search an indexed face against all other faces.

That means for 100,000 images of 1 performer each I have to pay 2 x 1/10 cent
You can work out the cost for a few 100,000 performers

The issue isn’t getting the Bang metadata, it’s that Bang metadata is terrible. If you’re lucky it’s something like Red Light District and just the date is way off, and it’s pretty easy to find a decent date on it. Sometimes it’s from a redistribution of a redistribution and I try to track down the source of the scene manually, to wildly varying degrees of success (There’s a redistribution studio, Defiant Films, that is the bane of my existence. There’s a Dillion Harper scene that I think was literally birthed from the ether fully formed for all I can find out about it). I know iafd and ade have their issues with metadata as well, but Bang is almost guaranteed to be wrong, as all their dates are just when Bang acquired the scene.