Haven VLM Connector

HavenCTO · January 27, 2026, 3:46am


	Summary	Tag videos with Vision-Language Models using any OpenAI-compatible VLM endpoint.
	Repository	https://github.com/stashapp/CommunityScripts/tree/main/plugins/AHavenVLMConnector
	Source URL	https://stashapp.github.io/CommunityScripts/stable/index.yml
	Install	How to install a plugin?

A Haven VLM Connector

A StashApp plugin for Vision-Language Model (VLM) based content tagging and analysis. This plugin is designed with a local-first philosophy, empowering users to run analysis on their own hardware (using CPU or GPU) and their local network. It also supports cloud-based VLM endpoints for additional flexibility. The Haven VLM Engine provides advanced automatic content detection and tagging, delivering superior accuracy compared to traditional image classification methods.

Features

Local Network Empowerment: Distribute processing across home/office computers without cloud dependencies
Context-Aware Detection: Leverages Vision-Language Models’ understanding of visual relationships
Advanced Dependency Management: Uses PythonDepManager for automatic dependency installation
Enjoying Funscript Haven? Check out more tools and projects at Human Activity Valuation and Exploration Network · GitHub

Requirements

Python 3.8+
StashApp
PythonDepManager plugin (automatically handles dependencies)
OpenAI-compatible VLM endpoints (local or cloud-based)

Installation

Clone or download this plugin to your StashApp plugins directory
Ensure PythonDepManager is installed in your StashApp plugins
Configure your VLM endpoints in haven_vlm_config.py (local network endpoints recommended)
Restart StashApp

The plugin automatically manages all dependencies.

Why Local-First?

Complete Control: Process sensitive content on your own hardware
Cost Effective: Avoid cloud processing fees by using existing resources
Flexible Scaling: Add more computers to your local network for increased capacity
Privacy Focused: Keep your media completely private
Hybrid Options: Combine local and cloud endpoints for optimal flexibility

graph LR
A[User's Computer] --> B[Local GPU Machine]
A --> C[Local CPU Machine 1]
A --> D[Local CPU Machine 2]
A --> E[Cloud Endpoint]

Configuration

Easy Setup with LM Studio

LM Studio provides the easiest way to configure local endpoints:

Download and install LM Studio
Search for or download a vision-capable model; tested with : (in order of high to low accuracy) zai-org/glm-4.6v-flash, huihui-mistral-small-3.2-24b-instruct-2506-abliterated-v2, qwen/qwen3-vl-8b, lfm2.5-vl
Load your desired Model
On the developer tab start the local server using the start toggle
Optionally click the Settings gear then toggle Serve on local network
Optionally configure haven_vlm_config.py:

By default locahost is included in the config, remove cloud endpoint if you don’t want automatic failover

{
    "base_url": "http://localhost:1234/v1",  # LM Studio default
    "api_key": "",                          # API key not required
    "name": "lm-studio-local",
    "weight": 5,
    "is_fallback": False
}

Tag Configuration

"tag_list": [
    "Basketball point", "Foul", "Break-away", "Turnover"
]

Processing Settings

VIDEO_FRAME_INTERVAL = 2.0  # Process every 2 seconds
CONCURRENT_TASK_LIMIT = 8   # Adjust based on local hardware

Usage

Tag Videos

Tag scenes with VLM_TagMe
Run “Tag Videos” task
Plugin processes content using local/network resources

Performance Tips

Start with 2-3 local machines for load balancing
Assign higher weights to GPU-enabled machines
Adjust CONCURRENT_TASK_LIMIT based on total system resources
Use SSD storage for better I/O performance

File Structure

AHavenVLMConnector/
├── ahavenvlmconnector.yml
├── haven_vlm_connector.py
├── haven_vlm_config.py
├── haven_vlm_engine.py
├── haven_media_handler.py
├── haven_vlm_utility.py
├── requirements.txt
└── README.md

Troubleshooting

Local Network Setup

Ensure firewalls allow communication between machines
Verify all local endpoints are running VLM services
Use static IPs for local machines
Check http://local-machine-ip:port/v1 responds correctly

Performance Optimization

Distribute Load: Use multiple mid-range machines instead of one high-end
GPU Prioritization: Assign highest weight to GPU machines
Network Speed: Use wired Ethernet connections for faster transfer
Resource Monitoring: Watch system resources during processing

mbze430 · February 13, 2026, 10:16pm

can this be run using llmster? instead of loading up a full GUI box.

HavenCTO · February 14, 2026, 12:17am

Yep, you can even add a high end mobile phone to the list of devices, anything on your local network that can run a gguf and expose an endpoint.

WagonWheelz · March 14, 2026, 1:55pm

@HavenCTO you mentioned some higher end phones being able to act as a local source for the LLM. I assume it would be significantly slower though. Do you have a guide for this as I’ve failed trying to set my Android up.

HavenCTO · March 20, 2026, 3:47am

Here is a step-by-step guide, that I’ve put together based on my own testing, on how to run the LLM models on an Android device using Termux and Ollama. This method allows your phone to act as a local AI server accessible by other devices on your network.

TLDR - download termux copy then paste this single line bash command into the terminal

mkdir -p ~/models ~/tmp && cd ~/models && wget -c --tries=10 --timeout=60 --show-progress -O GLM-4.6V-Flash.gguf "https://huggingface.co/unsloth/GLM-4.6V-Flash-GGUF/resolve/main/GLM-4.6V-Flash-UD-IQ2_M.gguf" && wget -c --tries=10 --timeout=60 --show-progress -O mmproj.gguf "https://huggingface.co/unsloth/GLM-4.6V-Flash-GGUF/resolve/main/mmproj-F16.gguf" && printf 'FROM %s/GLM-4.6V-Flash.gguf\nTEMPLATE """{{ .System }}<|im_start|>user\n{{ .Prompt }}<|im_end|>\n<|im_start|>assistant\n"""\nPARAMETER temperature 0.7\nPARAMETER top_p 0.8\nPARAMETER top_k 20\nPARAMETER num_ctx 8192\nPARAMETER stop "<|think|>"\nPARAMETER stop "<||>"\n' "$(pwd)" > Modelfile && OLLAMA_HOST=0.0.0.0:11434 ollama serve > ~/tmp/ollama.log 2>&1 & disown; sleep 3; until curl -s http://localhost:11434/api/tags > /dev/null 2>&1; do sleep 1; done; cd ~/models && ollama create glm-flash -f Modelfile && ollama list

Prerequisites

Device: A high-end Android phone (at least 8GB RAM recommended for better performance).
- Performance : Accurate benchmarking still needs to be done, but this project is intended to allow for multiple devices, more devices, faster processing
App: Download Termux from the Google Play Store.
- Note: While Termux is on the Play Store, for the absolute latest packages, users often install it from F-Droid.
Internet Connection: Required to download the model files.
Storage: Ensure you have enough space (approx. 4GB+ for the model weights).

Step 1: Install Termux and Update Packages

Open the Termux app you just downloaded from the Play Store.

Type the following command to update your package list and upgrade existing packages:
```
pkg update && pkg upgrade
```
Press Enter to confirm the updates.

Step 2: Install Required Tools

You need to install wget (for downloading files), curl (for testing the server), and ollama (the AI runtime).

Run the following command to install them:

pkg install wget curl proot

Wait for the installation to complete.

Step 3: Create Directories and Download the Model

We will create the necessary folders and download the specific GLM-4.6V-Flash model files.

Run this command to create the models and tmp directories and navigate to models:
```
mkdir -p ~/models ~/tmp && cd ~/models
```

Download the main model file (GLM-4.6V-Flash-UD-IQ2_M.gguf):

wget -c --tries=10 --timeout=60 --show-progress -O GLM-4.6V-Flash.gguf "https://huggingface.co/unsloth/GLM-4.6V-Flash-GGUF/resolve/main/GLM-4.6V-Flash-UD-IQ2_M.gguf"

Wait for the download to finish. This may take a few minutes depending on your speed.

Download the projector file (mmproj.gguf):

wget -c --tries=10 --timeout=60 --show-progress -O mmproj.gguf "https://huggingface.co/unsloth/GLM-4.6V-Flash-GGUF/resolve/main/mmproj-F16.gguf"

Step 4: Create the Ollama Modelfile

Now, we create a configuration file (Modelfile) that tells Ollama how to run this specific model.

Run the following command to generate the Modelfile:

printf 'FROM %s/GLM-4.6V-Flash.gguf\nTEMPLATE """{{ .System }}<|im_start|>user\n{{ .Prompt }}<|im_end|>\n<|im_start|>assistant\n"""\nPARAMETER temperature 0.7\nPARAMETER top_p 0.8\nPARAMETER top_k 20\nPARAMETER num_ctx 8192\nPARAMETER stop "<|think|>"\nPARAMETER stop "<||>"\n' "$(pwd)" > Modelfile

Step 5: Start the Ollama Server

We will start the Ollama server in the background so it runs continuously and listens on your local network.

Run this command to kill any existing Ollama processes, start the server, and log the output:
```
pkill -f ollama 2>/dev/null; OLLAMA_HOST=0.0.0.0:11434 ollama serve > ~/tmp/ollama.log 2>&1 & disown
```
The disown command ensures the server keeps running even if you close Termux.

Wait for the server to initialize. Run this command to check if the server is ready:

sleep 3; until curl -s http://localhost:11434/api/tags > /dev/null 2>&1; do sleep 1; done

Step 6: Create and Verify the Model

Finally, we create the model within Ollama using the Modelfile we just made.

Navigate back to the models directory (if not already there) and create the model:
```
cd ~/models && ollama create glm-flash -f Modelfile
```
This process will load the model into memory. It may take a moment.
List the available models to confirm it was created successfully:
```
ollama list
```

How to Connect Other Devices

Once the ollama list command shows glm-flash, your phone is acting as a local AI server.

Find your phone’s Local IP Address.
- In Termux, type: ip addr show or hostname -I.
- Look for an IP starting with 192.168.x.x or 10.x.x.x.
On any other device (laptop, tablet, another phone) on the same Wi-Fi, open a browser or an Ollama client.
Connect to the address: http://YOUR_PHONE_IP:11434.
You can now chat with the GLM-4.6V-Flash model running on your Android device!

Troubleshooting

Connection Refused: Ensure the server is running. Check the log file by typing cat ~/tmp/ollama.log in Termux.
Storage Full: Try a tiny model like the VL series from LFM (smaller model, less accurate)

Model Choice Considerations

The example is using unsloth/GLM-4.6V-Flash GLM-4.6V-Flash is fairly uncensored, but you can opt to use huihui-ai/Huihui-GLM-4.6V-Flash-abliterated-GGUF as an alternative
You may not need the mmproj file for the GLM models, but due to time constratints opted to test with the mmproj file.
I would recommend the latest Qwen35, however it reasons too much and as a result is very slow. Support for disabling reasoning in ollamaand llama.cpp is pending

HavenCTO · March 21, 2026, 5:43am

github.com/ollama/ollama

Ollama model files for Gemma3 specifying mmproj ggufs do not retain vision capability.

opened 08:16PM - 24 Mar 25 UTC

closed 01:11AM - 25 Mar 25 UTC

lkraven

bug

What is the issue? When creating an ollama modelfile with two FROM statements, …one with the primary model and one with the projector model such as: `ollama create -f gemma3-i-4-gguf gemma3:4b_Q6_K` ``` FROM /Storage/bartowski_google_gemma-3-4b-it-GGUF/google_gemma-3-4b-it-Q6_K.gguf FROM /Storage/bartowski_google_gemma-3-4b-it-GGUF/mmproj-google_gemma-3-4b-it-f32.gguf TEMPLATE """{{- range $i, $_ := .Messages }} {{- $last := eq (len (slice $.Messages $i)) 1 }} {{- if or (eq .Role "user") (eq .Role "system") }}<start_of_turn>user {{ .Content }}<end_of_turn> {{ if $last }}<start_of_turn>model {{ end }} {{- else if eq .Role "assistant" }}<start_of_turn>model {{ .Content }}{{ if not $last }}<end_of_turn> {{ end }} {{- end }} {{- end }}""" PARAMETER temperature 1.0 PARAMETER top_k 64 PARAMETER top_p 0.95 PARAMETER min_p 0.0 PARAMETER repeat_penalty 1.0 PARAMETER stop <end_of_turn> ``` Even though Ollama shows the CLIP file: `ollama show gemma3:4b_Q6_K` ``` Model architecture gemma3 parameters 3.9B context length 131072 embedding length 2560 quantization unknown Projector architecture clip parameters 419.82M embedding length 1152 dimensions 2560 Parameters repeat_penalty 1 stop "<end_of_turn>" temperature 1 top_k 64 top_p 0.95 min_p 0 ``` When trying to pass an image, this is what you get: `Mar 24 13:13:05 ana-ml1 ollama[3565562]: time=2025-03-24T13:13:05.939-07:00 level=INFO source=server.go:766 msg="llm predict error: Failed to create new sequence: failed to process inputs: this model is missing data required for image input"` Is this the correct way to add an mmproj to a quantized model? ### Relevant log output ```shell ``` ### OS Debian ### GPU A6000 ### CPU _No response_ ### Ollama version 0.62

first line of the Modelfile needs instead of

FROM %s/GLM-4.6V-Flash.gguf\nTEMPLATE """{{ .System }}

needs to be

FROM %s/\nTEMPLATE """{{ .System }}

DogmaDragon · March 22, 2026, 2:54am