Using TPDB as primary source for StashDB scene submissions

Is TPDB allowed to be used as the primary (or only) source when submitting scene data to StashDB?

(My guess: the official studio site, potentially including archived snapshots, should be used as the primary source for scene data whenever possible. The exception is that if the studio site no longer exists or doesn’t have the scene anymore, TPDB could be used as a source, and the edit comment should explain why this was necessary.)

I’d say no. We don’t want StashDB to become a copy of TPDB. Their standards for metadata are lower

to quote from the Stash Docs:

TPDB has significantly more scenes than StashDB thanks to their automated scrapers, but their info isn’t always as complete or accurate compared to StashDB’s manually curated approach.

Anecdotal, but it seems like most of the time I’ve encountered bad data, TPDB was involved.

1 Like

Most of the time incorrect data will revolve around performers. The life of automated scrapers makes performers kind of a pain point. The main scene data is normally taken directly from first party sources and I would say is solid most of the time. Now if things change after the fact then that’s not something we can really be blamed for and need to wait till someone brings it up for us to manually modify it or rerun scrapers.

This is not me agreeing that TPDB should be used as a first party source, as I don’t think it should, just defending my fellow TPDB guys a little :stuck_out_tongue:

2 Likes

My 2c: Primary sources should always be used when available, following the same rules/ guidelines as Split Movies

An archived page (from the Wayback Machine, for example) is almost always the best primary source for info. A link to an archive of the scene’s page may be used as the studio link as well, but if your source is an archive of a different page (home page, tour page, performer profile, list of scenes, etc.) then it should only be used in your edit comment. If no archived page can be found, info may be sourced from third party databases (IAFD, Data18, etc.) instead.

and taking the priority from Scene Tags

The primary source for manually curating a scene’s tags is the video itself. Other sources may be used as well, but the video supersedes any other conflicting sources.

Just be sure to mention in your edit comment what your source is and try not to overrule a source with a higher level of authority than yours.

(emphasis mine)

To counter this theory, mostly it’s casing or performer related issues.

To double down on this theory - Data18 as a primary source is far more annoying!

1 Like

Requiring a primary source is tricky, because often that’s easier said than done. Scrapers break, paywalls go up, more studios start geoblocking… it’s a lot to keep track of. And if the scene’s been taken down, or if the website simply doesn’t exist anymore, you’d better hope you can find a snapshot on the Wayback Machine.

So unfortunately, cobbling together scraps from secondary sources like IAFD, Data18, Indexxx, and — yes — TPDB could be your best option. Depending on the situation, any one of those three could be more useful than the others. Each one has their own strengths, weaknesses, and blind spots, so it’s going to be different for every scene and studio.

(Sidenote: We’ve discussed using the Ministry category here to track quirks for particular studios. Maybe it would be worth doing the same for secondary sources too. Strengths vs. weaknesses, tips for finding things quickly, quirks to look out for, that sort of thing.)

I don’t think we’ve ever established a firm consensus on this sort of thing, unfortunately. If the studio link is publicly accessible and easily scraped, then yes, I think it’s fair to expect that to be used as the primary source in a scene creation. And seeing those scene creations use a secondary source alone (typically TPDB) is frustrating when you can see all the eccentricities that introduced (weird casing, altered URLs, under-sized covers, missing tags, etc.), especially when you know fixing those could be as easy as clicking a single button.

But is that enough to justify a downvote? I’m not so sure. There’s no explicit guideline that’s being violated here. Maybe it falls under the umbrella of a “low-effort” submission. But like I said earlier, there’s a lot to keep track of. Are you sure they’re blindly scraping TPDB because they’re lazy? Or is there something else (broken scraper, paywall, geoblock) preventing them from scraping the studio directly? Because if it turns out there was no better option, then a downvote would be counter-productive. Do we expect OPs to know the difference? Do we expect voters to?

For me, I’ll only downvote if I can tell the data doesn’t match the primary source and they don’t say where the data came from instead. Because then the violation is actually an insufficient comment, not the use of a secondary source or a general lack of effort.

But if I know the studio link is scrapable and they comment something simple like “Scraped from TPDB” without any other explanation? Then I’ll leave an Abstain vote (ensures I get notifications), list any mismatched data I’ve noticed, and point out how scraping the studio link would’ve easily fixed all of that. Basically, I’ll resort to nagging them instead. That way it doesn’t block the edit with a downvote, doesn’t assume it’s either laziness or ignorance from the OP, provides a list of requests that are immediately actionable, and hopefully encourages higher quality edits moving forward.

Hopefully that’s helpful. I usually try not to dictate when voters “must” vote one way or another. Too much of the edit queue isn’t so black-and-white, requiring voters to be more flexible and use their best judgment. So without a hard guideline to fall back on, these are just my own thoughts on the matter. Downvoting as “low-effort”, upvoting because it’s close enough, or (respectfully) annoying them in the comments, to me that’s all a matter of opinion and personal preference.

In my experience, these types of edits are generally low-quality, mostly missing tags and a male cast, they are just submitted without any verification.
For me they are welcome if additional work has been added, such as verifying the tags or visually integrating the cast, etc., and also I would make it mandatory to include the TPDB link.
It makes no sense to submit an edit and not include the source of the scrape, even for verification purposes for voters.

2 Likes

Since I just saw it. Now that TPDB scrapes fansdb and some of those scenes are making it into the que. Need to be extra critical of TPDB only sources.

3 Likes