JAV english scraper, how do I get it to work?

I just spent 16 hours reading crawling throug various pages , dicussions trying everything. The only scraper that work is the StashDB and JAVstash(which to my understanding is more like a metadata warehouse) of which is not ideal becasue StashDB have so few JAV title and mix between ENG and JAP. The other is only Japanese language. I tried every built-in community scraper and none of them work either they return error like “dial tcp: lookup www.javlibrary.com: no such host”. then I tried install the skScraper which I don’t know how exactly it work, There is no new scraper option.
I’am at my wit’s end here.

Ther very least I want is just a cover image and title name in english.

I think the architecture as it stands for scraping metadata through existing implemented methods on stash has ran its course and probably needs to be revisited. You are at the mercy of the web host or cloud providers security which is a big no no.

People can suggest stashdb, but I have no interest in using that feature. For users like us, there needs to be a more modular solution. I have got around this in the past by using local scripts and passing the browser session key. Perhaps i(or somebody anyway) could make the ground work for a modular plugin to pass this key in to scrape sites. This should bypass cloudflare.

To be honest though I havent really sat down and thought on it for a long time. This is merely an idea ive thought about in the last hour, but the problem is certainly real and its definitely time to rethink a solution.

JavLibrary_python works, but it requires setting up Flaresolverr instance.

JavDB scraper works: https://scrape.feederbox.cc/scene?id=mYpUybMB
JAVDatabase scraper works: https://scrape.feederbox.cc/scene?id=6Rt8QMWh (requires valid User-Agent to be set in Stash)
Not sure which others you tried.

You can also scrape directly from R18.dev who offer dumps of their metadata at https://r18.dev/dumps. You can then use local Stash scraper from JAV Stash admin to scrape against it.

Im sure he means in the UI. Which doesnt work.

More so that no user-input required scrapers are becoming more rare. Another thing we can thank LLMs for. Before anti-scraper measures were an afterthought as hobbiest scrapers weren’t a threat. Not that multiple bots powered by LLMs mass scrape the whole internet on a daily basis, it can’t be an afterthought as no protection can lead to them effectively DDoSing you and in some cases leading to expensive bills.

Yeah im well aware of the cause but we need a solution. Ai and scrapers aren’t going anywhere anytime soon. In some light reading we might be able to use playwright as a plugin. But its all guess work for me for now.

Looks like it needs a valid User-Agent. Updated my comment.

I’ve documented it before but its basically on a range from

  • No protections
  • Basic Referer/User-Agent bypass
  • IP rep block
  • TLS impersonation
  • simple CF challenge (cloudscraper)
  • advanced CF challenge (flaresolverr)
  • new advanced CF challenge (byparr)

solutions:

CDP lies above TLS impersonation but below byparr/flaresolverr. Automation like CDP, Playright and Selenium are all detectable, see GitHub - ultrafunkamsterdam/undetected-chromedriver: Custom Selenium Chromedriver | Zero-Config | Passes ALL bot mitigation systems (like Distil / Imperva/ Datadadome / CloudFlare IUAM) · GitHub for previous attempts on undetected selenium webdriver.

The ultimate solution is

  • residential SOCKS5 UDP proxy ($3/GB)
  • up-to-date TLS impersonation
  • fallback on byparr/flaresolverr for challenges

Playright/ Selenium will not work for CF “under attack” tunstile checks

Nice to see its on your radar but Im assuming this is mostly with automation in mind? While I know thats the goal for anything these days, I dont think people wouldnt mind a manual option as long as the clicks are minimal. Like I mentioned before passing a session key or browser session to impersonate could be a decent workaround.

browser sessions/ session keys are ephemeral (on the scale of 30mins-1h) so yes, complete automation is preferred.

TLS impersonation should be quite light and fast (negligible memory/ binary increase) compared to the 300MB of disk for chromedriver/byparr and 500MB-1G of memory usage

Ill back off for now then. You seem like you have much more complex understanding than I do. I appreciate you looking into it.