Use caution when adding studios' tags to scenes with existing tags- they may dilute the quality

Studios may use tags in ways that are incompatible with the StashDB definition of that tag, and some studios (TeamSkeet, probably others) have a bunch of tags that are flat out incorrect.

I’ve accepted that for new scenes or scenes without existing tags, it’s probably better to have the studio’s tags rather than none at all. But for scenes that do have existing tags in StashDB, those tags were often (can be checked via edit log notes) manually added by editors who checked the video content itself, and are therefore very likely to be accurate. Adding the studio’s tags risks changing a high quality set of tags to low quality.

Therefore, I strongly recommend that editors who are adding studio scraped tags to existing tags check whether the existing ones are based on scene content, and if so, only add the studio tags that they can verify exist in the video.

This post is a PSA, but I’d also like to open discussion on:

  • How do you handle these cases when voting on edits? Personally, when I see this type of edit: if there is a particular tag that is incorrect or looks highly suspicious, I vote No and note that tag (example). Otherwise, I may add a note without a vote, or just ignore the edit.
  • Which studios tend to have low quality tags? In my experience, Aylo studios are usually ~accurate except that their definitions of tags conflict with StashDB’s (see previously linked example). TeamSkeet studios often have objectively inaccurate tags.

So aside from that the tag on the example is technically valid (Character is the wording & not performers). The general scope of the discussion is 100% right.

From a technical point of view, both the Aylo & TeamSkeet scrapers are in a position where we can point incorrect/vague tags to the preferred tag (The Aylo scraper already has this up and running).

The issue will be with things like TeamSkeet throwing the Creampie tag around as if it’s confetti or their use of ‘Japanese’ just because the description has Japanese food mentioned.

Tags are always going to be an issue because everyone has different tagging standards, I’m with AdultSun in trying to obliterate vague tags, but if I replace ‘Brown Hair’ with ‘Brown Hair (Female)’, ‘Brown Hair’ comes back because “I don’t use those, I want ‘Brown Hair’ back”.

By and large, I just ignore what’s on StashDB for tags these days, mainly because it’s either missing half the studio tags or someone’s dicked around with them. Far less stressful to scrape StashDB, rescrape studio, tidy and be happy. :grinning_face_with_smiling_eyes:

1 Like

That seems like a different discussion, as Brown Hair isn’t objectively incorrect as long as Brown Hair (Female) is correct.

To clarify(?), I’m talking about cases where a tag is
“incompatible with the StashDB definition”: Would be correct under another reasonable interpretation (e.g. Tease has a StashDB definition of “Erotic performance near the beginning of a scene, often while alone” but could otherwise be interpreted to include playfully teasing someone or teasing one’s hair) or
“flat out incorrect”: Incorrect under any reasonable interpretation (e.g. Blowjob on a scene that has no oral sex)

Both ought to be avoided where possible.

It was more a highlight of there’s multiple people working in multiple ways for much the same end goal rather than a direct reference to an incorrect tag :slight_smile:

I’ve handled this once, notably scripts/bot-edit/ts-pumps at main · feederbox826/scripts · GitHub

I’d like to think that it was handled quite well, bot edit permissions were the hardest part to get, the manual override of tags wasn’t too bad

Using your example where there are a handful of manual tags but the vast majority of studio tags are missing, I typically don’t worry about it too much. If I spot any inaccurate tags, I’d probably call it out with a comment but wouldn’t downvote the rest of the edit over it. Better to fix those issues with an update or a follow-up edit instead of having to re-create and re-submit the whole thing later.

For me, it’s a “pick your battles” situation. We know where these tags are coming from. And due to some of the limitations in the software, it’s still a lot easier to refine a list of tags that may not be completely accurate than to add all of those tags manually, one-by-one. So right now, I’m a lot more forgiving of one or two bad tags in a scraped list of 20 than an edit that removes or replaces a few tags that already seem good.

1 Like

See my proposal for a possible long-term solution: [RFC] More Accurate Tags by Using Tag Namespaces/Sources · Issue #848 · stashapp/stash-box · GitHub

I think it’s possible to start by implementing that only for new scenes and then gradually updating older ones as necessary.

That is a good idea for a fresh stash-box instance. I just don’t see how it would meaningfully work work with already existing dataset.

Currently, tags in stash-box instances are modified by the existing alias system and original tags are not stored. So it would require a complete re-scrape of the source URL. As result it wouldn’t even be possible to remap existing tags to a new system.

For StashDB to really use it, there would have to be a purge of all tags for it to be effective.

Thank you for your feedback. I agree with you that purging all tags and re-scraping everything would be extremely inefficient. That would require a huge amount of work for little benefit, and I’m against that.

I’m trying to reduce the amount of effort required for importing tags, while increasing the quality of the dataset.

In any case, it’s only an idea that I wanted to share in case it could be useful. :grinning_face: Just wanted to be clear about what I mean by it.

Applying this idea to some newly-imported scenes doesn’t require changing the existing tags. I can illustrate that.

Existing scene

Imagine there’s an existing scene in StashDB with some tags:

Barefoot, Brown Hair, Ponytail

After implementing the idea, the scene’s tags would look like this (in the database):

Source Tag
StashDB Barefoot
StashDB Brown Hair
StashDB Ponytail

So, existing tags stay the same. It’s just that now the source of the tags is explicit. As you’ve pointed out, at this point it’s impossible to tell what the original tags were. But a user made an effort to tag the scene, and the edit was checked by a moderator, so it’s meaningful to say that StashDB is the source and the owner of this list of scene’s tags.

Importing a new scene

It would still be possible to import scenes like before, but a new way would be possible too.

Imagine that a studio lists these tags for a scene:

4k, small tits, petite

Then they can be automatically transformed into this list in StashDB’s scene information:

Source Tag
Cool Studio 4k
Cool Studio small tits
Cool Studio petite

And let’s suppose that it would make sense to make “Cool Studio / petite” a sub-tag of “StashDB / Short Woman”. That would make it possible to find this scene by the tag “StashDB / Short Woman”, too.

Then, a user of StashDB decides to add some more tags to the video. They add the “Blonde Hair (Female)” tag. Now, the list of tags looks like this:

Source Tag
Cool Studio 4k
Cool Studio small tits
Cool Studio petite
StashDB Blonde Hair (Female)

Another thing to notice is that it’s also possible (but not required) to automatically import original tags for existing videos that way, appending them to the list of existing tags. That is, if there’s a need for that, this way of doing things allows that.

How that could look like in the StashDB's UI

Maybe it would also be meaningful to show the mapped tags to the user instead, when applicable:

But that’s a whole other story. I don’t have all the answers.

3 Likes