Advanced YAML scraping

feederbox826 · October 24, 2025, 4:51am

tl;dr

use (and copy from) existing scrapers (as long as you respect the licence)
Contributing scrapers - Stash-Docs
for xpath, Xpath cheatsheet
LLMs prefer writing python, but it is much slower than .yml
disable javascript on sites

selectors and opengraph

Xpath cheatsheet is one of the best references for xpath.

xpath testing can be done locally in-browser, no extensions needed. Enter dev mode (F12) and in the Elements tab, hit Ctrl + F. A search box pops up which allows finding by selector, string or XPath. Plug your query in and it will validate it and show results

Selecting element in browser and copying the full xpath is not usually very helpful, fortunately you can use uBlock Origin and right click for “Block Element” and a pop-up will appear that lets you hover and select elements using a simplfied path you can later convert to xpath.
This can be done by removing the two preceding # from the selector, as that’s the syntax used by AdGuard (and uBO)

opengraph

Websites LOVE SEO and one of the ways is to support OpenGraph. Let’s exploit it.

In dev mode, head to elements and open up the <head> element. Look for <meta> blocks. These usually contain opengraph descriptions and machine-readable properties already pre-formatted for our ~~scraper~~ SEO needs. These can be selected with the following scraper syntax

/html/head/meta[@property="og:title"]/@content

If you can’t seem to find any obvious ones, go back to the search bar (Ctrl + F) and start typing in the text content that you want to scrape (title, description, studio) and see if there are multiple or cleaner results that can be scraped from instead.

There is also ld+json and __NEXT_DATA__ but that goes into regex.

regex (and unholy sins)

regex is tricky, https://regexr.com/ and https://regex101.com/ are great resouces for testing regex patterns
Keep in mind that the postprocess replace: is more of a substitution. Anything not selected will remain in the string, so select everything if you want it wholly gone.

`ld+json`

ld-json is a really popular and powerful way to get metadata and is already used in multiple scrapers

Unfortunately this requires us to commit some unholy sins

Here’s my shortcut to sinning

Title:
  selector: //script[@type="application/ld+json"]/text()
  postProcess:
    - replace:
	  - regex: .*"fieldName":\s?"([^"]).+"
	    with: $1

Throw that in regexr and we can see that it captures the value for the fieldName property and then captures it into $1 for parsing. This has a lot of benefits similar to OpenGraph but since it’s in JSON it’s much harder to parse.
If you are reusing it for multiple properties, throwing it into common can save you a lot of wasted characters down the line

`__NEXT_DATA__`

This is where it gets weird and wacky. Next is a framework used by a few sites that is very similar to ld+json. The main difference is that you’ll want to disable javascript in your browser for that site in order to see it for most of it’s glory and to properly search it. NEXT_DATA usually contains all the metadata displayed on the site, but is usually nested multiple levels of JSON deep.

AD4X is a great example of scraping NEXT_DATA.

If we take a look at the regex

.+"content":{(?:[^}]+)"publish_date":"(\d{4}\/\d{2}\/\d{2}).+

there is one key difference, content and [^}]+. This matches everything underneath the content: object. NEXT_DATA is usually minifed so throwing it into a JSON prettier really helps.
The reason for the content block is because NEXT_DATA includes everything, it also includes data for other scenes (recommendations, etc…). Be very wary of sub-blocks and arrays when scraping __NEXT_DATA__, it’s much trickier than ld+json

troubleshooting

Disable javascript
Many sites have frontend loaders, these can sometimes transform the HTML for the browser. Stash does not see this unless useCDP is enabled, which is slower, intensive with RAM and not as accessible. Diabling javascript and scraping the sites elements usually does the trick