Haven VLM Connector

:placard: Summary Tag videos with Vision-Language Models using any OpenAI-compatible VLM endpoint.
:link: Repository https://github.com/stashapp/CommunityScripts/tree/main/plugins/AHavenVLMConnector
:information_source: Source URL https://stashapp.github.io/CommunityScripts/stable/index.yml
:open_book: Install How to install a plugin?

A Haven VLM Connector

A StashApp plugin for Vision-Language Model (VLM) based content tagging and analysis. This plugin is designed with a local-first philosophy, empowering users to run analysis on their own hardware (using CPU or GPU) and their local network. It also supports cloud-based VLM endpoints for additional flexibility. The Haven VLM Engine provides advanced automatic content detection and tagging, delivering superior accuracy compared to traditional image classification methods.

Features

  • Local Network Empowerment: Distribute processing across home/office computers without cloud dependencies
  • Context-Aware Detection: Leverages Vision-Language Models’ understanding of visual relationships
  • Advanced Dependency Management: Uses PythonDepManager for automatic dependency installation
  • Enjoying Funscript Haven? Check out more tools and projects at Human Activity Valuation and Exploration Network Β· GitHub

Requirements

  • Python 3.8+
  • StashApp
  • PythonDepManager plugin (automatically handles dependencies)
  • OpenAI-compatible VLM endpoints (local or cloud-based)

Installation

  1. Clone or download this plugin to your StashApp plugins directory
  2. Ensure PythonDepManager is installed in your StashApp plugins
  3. Configure your VLM endpoints in haven_vlm_config.py (local network endpoints recommended)
  4. Restart StashApp

The plugin automatically manages all dependencies.

Why Local-First?

  • Complete Control: Process sensitive content on your own hardware
  • Cost Effective: Avoid cloud processing fees by using existing resources
  • Flexible Scaling: Add more computers to your local network for increased capacity
  • Privacy Focused: Keep your media completely private
  • Hybrid Options: Combine local and cloud endpoints for optimal flexibility
graph LR
A[User's Computer] --> B[Local GPU Machine]
A --> C[Local CPU Machine 1]
A --> D[Local CPU Machine 2]
A --> E[Cloud Endpoint]

Configuration

Easy Setup with LM Studio

LM Studio provides the easiest way to configure local endpoints:

  1. Download and install LM Studio
  2. Search for or download a vision-capable model; tested with : (in order of high to low accuracy) zai-org/glm-4.6v-flash, huihui-mistral-small-3.2-24b-instruct-2506-abliterated-v2, qwen/qwen3-vl-8b, lfm2.5-vl
  3. Load your desired Model
  4. On the developer tab start the local server using the start toggle
  5. Optionally click the Settings gear then toggle Serve on local network
  6. Optionally configure haven_vlm_config.py:

By default locahost is included in the config, remove cloud endpoint if you don’t want automatic failover

{
    "base_url": "http://localhost:1234/v1",  # LM Studio default
    "api_key": "",                          # API key not required
    "name": "lm-studio-local",
    "weight": 5,
    "is_fallback": False
}

Tag Configuration

"tag_list": [
    "Basketball point", "Foul", "Break-away", "Turnover"
]

Processing Settings

VIDEO_FRAME_INTERVAL = 2.0  # Process every 2 seconds
CONCURRENT_TASK_LIMIT = 8   # Adjust based on local hardware

Usage

Tag Videos

  1. Tag scenes with VLM_TagMe
  2. Run β€œTag Videos” task
  3. Plugin processes content using local/network resources

Performance Tips

  • Start with 2-3 local machines for load balancing
  • Assign higher weights to GPU-enabled machines
  • Adjust CONCURRENT_TASK_LIMIT based on total system resources
  • Use SSD storage for better I/O performance

File Structure

AHavenVLMConnector/
β”œβ”€β”€ ahavenvlmconnector.yml
β”œβ”€β”€ haven_vlm_connector.py
β”œβ”€β”€ haven_vlm_config.py
β”œβ”€β”€ haven_vlm_engine.py
β”œβ”€β”€ haven_media_handler.py
β”œβ”€β”€ haven_vlm_utility.py
β”œβ”€β”€ requirements.txt
└── README.md

Troubleshooting

Local Network Setup

  • Ensure firewalls allow communication between machines
  • Verify all local endpoints are running VLM services
  • Use static IPs for local machines
  • Check http://local-machine-ip:port/v1 responds correctly

Performance Optimization

  • Distribute Load: Use multiple mid-range machines instead of one high-end
  • GPU Prioritization: Assign highest weight to GPU machines
  • Network Speed: Use wired Ethernet connections for faster transfer
  • Resource Monitoring: Watch system resources during processing
2 Likes

can this be run using llmster? instead of loading up a full GUI box.

Yep, you can even add a high end mobile phone to the list of devices, anything on your local network that can run a gguf and expose an endpoint.

@HavenCTO you mentioned some higher end phones being able to act as a local source for the LLM. I assume it would be significantly slower though. Do you have a guide for this as I’ve failed trying to set my Android up.

Here is a step-by-step guide, that I’ve put together based on my own testing, on how to run the LLM models on an Android device using Termux and Ollama. This method allows your phone to act as a local AI server accessible by other devices on your network.

TLDR - download termux copy then paste this single line bash command into the terminal

mkdir -p ~/models ~/tmp && cd ~/models && wget -c --tries=10 --timeout=60 --show-progress -O GLM-4.6V-Flash.gguf "https://huggingface.co/unsloth/GLM-4.6V-Flash-GGUF/resolve/main/GLM-4.6V-Flash-UD-IQ2_M.gguf" && wget -c --tries=10 --timeout=60 --show-progress -O mmproj.gguf "https://huggingface.co/unsloth/GLM-4.6V-Flash-GGUF/resolve/main/mmproj-F16.gguf" && printf 'FROM %s/GLM-4.6V-Flash.gguf\nTEMPLATE """{{ .System }}<|im_start|>user\n{{ .Prompt }}<|im_end|>\n<|im_start|>assistant\n"""\nPARAMETER temperature 0.7\nPARAMETER top_p 0.8\nPARAMETER top_k 20\nPARAMETER num_ctx 8192\nPARAMETER stop "<|think|>"\nPARAMETER stop "<||>"\n' "$(pwd)" > Modelfile && OLLAMA_HOST=0.0.0.0:11434 ollama serve > ~/tmp/ollama.log 2>&1 & disown; sleep 3; until curl -s http://localhost:11434/api/tags > /dev/null 2>&1; do sleep 1; done; cd ~/models && ollama create glm-flash -f Modelfile && ollama list

Prerequisites

  • Device: A high-end Android phone (at least 8GB RAM recommended for better performance).
    • Performance : Accurate benchmarking still needs to be done, but this project is intended to allow for multiple devices, more devices, faster processing
  • App: Download Termux from the Google Play Store.
    • Note: While Termux is on the Play Store, for the absolute latest packages, users often install it from F-Droid.
  • Internet Connection: Required to download the model files.
  • Storage: Ensure you have enough space (approx. 4GB+ for the model weights).

Step 1: Install Termux and Update Packages

Open the Termux app you just downloaded from the Play Store.

  1. Type the following command to update your package list and upgrade existing packages:

    pkg update && pkg upgrade
    
  2. Press Enter to confirm the updates.

Step 2: Install Required Tools

You need to install wget (for downloading files), curl (for testing the server), and ollama (the AI runtime).

Run the following command to install them:

pkg install wget curl proot

Wait for the installation to complete.

Step 3: Create Directories and Download the Model

We will create the necessary folders and download the specific GLM-4.6V-Flash model files.

  1. Run this command to create the models and tmp directories and navigate to models:

    mkdir -p ~/models ~/tmp && cd ~/models
    
  2. Download the main model file (GLM-4.6V-Flash-UD-IQ2_M.gguf):

    wget -c --tries=10 --timeout=60 --show-progress -O GLM-4.6V-Flash.gguf "https://huggingface.co/unsloth/GLM-4.6V-Flash-GGUF/resolve/main/GLM-4.6V-Flash-UD-IQ2_M.gguf"
    

    Wait for the download to finish. This may take a few minutes depending on your speed.

  3. Download the projector file (mmproj.gguf):

    wget -c --tries=10 --timeout=60 --show-progress -O mmproj.gguf "https://huggingface.co/unsloth/GLM-4.6V-Flash-GGUF/resolve/main/mmproj-F16.gguf"
    

Step 4: Create the Ollama Modelfile

Now, we create a configuration file (Modelfile) that tells Ollama how to run this specific model.

  1. Run the following command to generate the Modelfile:

    printf 'FROM %s/GLM-4.6V-Flash.gguf\nTEMPLATE """{{ .System }}<|im_start|>user\n{{ .Prompt }}<|im_end|>\n<|im_start|>assistant\n"""\nPARAMETER temperature 0.7\nPARAMETER top_p 0.8\nPARAMETER top_k 20\nPARAMETER num_ctx 8192\nPARAMETER stop "<|think|>"\nPARAMETER stop "<||>"\n' "$(pwd)" > Modelfile
    

Step 5: Start the Ollama Server

We will start the Ollama server in the background so it runs continuously and listens on your local network.

  1. Run this command to kill any existing Ollama processes, start the server, and log the output:

    pkill -f ollama 2>/dev/null; OLLAMA_HOST=0.0.0.0:11434 ollama serve > ~/tmp/ollama.log 2>&1 & disown
    

    The disown command ensures the server keeps running even if you close Termux.

  2. Wait for the server to initialize. Run this command to check if the server is ready:

    sleep 3; until curl -s http://localhost:11434/api/tags > /dev/null 2>&1; do sleep 1; done
    

Step 6: Create and Verify the Model

Finally, we create the model within Ollama using the Modelfile we just made.

  1. Navigate back to the models directory (if not already there) and create the model:

    cd ~/models && ollama create glm-flash -f Modelfile
    

    This process will load the model into memory. It may take a moment.

  2. List the available models to confirm it was created successfully:

    ollama list
    

How to Connect Other Devices

Once the ollama list command shows glm-flash, your phone is acting as a local AI server.

  1. Find your phone’s Local IP Address.
    • In Termux, type: ip addr show or hostname -I.
    • Look for an IP starting with 192.168.x.x or 10.x.x.x.
  2. On any other device (laptop, tablet, another phone) on the same Wi-Fi, open a browser or an Ollama client.
  3. Connect to the address: http://YOUR_PHONE_IP:11434.
  4. You can now chat with the GLM-4.6V-Flash model running on your Android device!

Troubleshooting

  • Connection Refused: Ensure the server is running. Check the log file by typing cat ~/tmp/ollama.log in Termux.
  • Storage Full: Try a tiny model like the VL series from LFM (smaller model, less accurate)

Model Choice Considerations

  • The example is using unsloth/GLM-4.6V-Flash GLM-4.6V-Flash is fairly uncensored, but you can opt to use huihui-ai/Huihui-GLM-4.6V-Flash-abliterated-GGUF as an alternative
  • You may not need the mmproj file for the GLM models, but due to time constratints opted to test with the mmproj file.
  • I would recommend the latest Qwen35, however it reasons too much and as a result is very slow. Support for disabling reasoning in ollamaand llama.cpp is pending

first line of the Modelfile needs instead of

FROM %s/GLM-4.6V-Flash.gguf\nTEMPLATE """{{ .System }}

needs to be

FROM %s/\nTEMPLATE """{{ .System }}