KoboldCpp Guide: Run Local LLMs for SillyTavern & MiniTavern (Privacy-First Setup in 2026)

Among local LLM backends for SillyTavern and MiniTavern, KoboldCpp is the veteran power-user choice: a single portable executable built on llama.cpp, no installer required, tuned for GGUF models, and wired into the tavern ecosystem since the KoboldAI days. If you want character-card roleplay with zero cloud API keys and full control over GPU layers, context size, and sampling—KoboldCpp deserves a spot on your shortlist.

This guide explains what KoboldCpp is, how it differs from LM Studio and Ollama, and walks through setup with SillyTavern and MiniTavern in 2026.

What Is KoboldCpp?

KoboldCpp (by LostRuins/koboldcpp) is a self-contained local inference server for GGUF and legacy GGML models. Download the binary for your OS, pick a model, click Launch, and you get:

Kobold API at http://localhost:5001/api/ (native text-completion protocol SillyTavern knows well)
OpenAI-compatible API at http://localhost:5001/v1/ (chat/completions for newer ST connectors)
KoboldAI Lite — embedded browser UI to sanity-check generations before opening the tavern

Unlike a cloud endpoint, your character cards, World Info, and chat logs never leave your machine.

KoboldCpp vs KoboldAI (Classic)

	KoboldCpp	KoboldAI United (classic)
Distribution	Single portable `.exe` / binary	Heavier install, Colab notebooks
Models	GGUF focus	Mixed formats
API	Kobold + OpenAI on :5001	Kobold API
SillyTavern	First-class KoboldCpp API type	Legacy KoboldAI connector
Maintenance	Active 2026 releases	Largely superseded by KoboldCpp for local use

When guides say “connect SillyTavern to KoboldAI locally,” they usually mean KoboldCpp today.

Key KoboldCpp Terminology

Term	Meaning
Quick Launch	GUI tab to browse a GGUF file, set context, GPU layers, and launch
GPU Layers (`n_gpu_layers`)	How many model layers run on GPU vs CPU—critical for VRAM tuning
Context Size	Max tokens KoboldCpp allocates—must be set before launch (defaults can cap at 4K)
CuBLAS / CUDA backend	NVIDIA GPU acceleration build (`koboldcpp.exe`)
nocuda build	Smaller binary; use Vulkan for AMD or CPU-only rigs
Kobold API	Text-completion endpoint ST uses with API Type = KoboldCpp
Remote Tunnel	KoboldCpp feature to expose a temporary public URL (e.g. Cloudflare) for off-LAN access
.kcppt	KoboldCpp preset/template file bundling model + launch settings
KoboldAI Lite	Built-in lightweight chat page for testing after launch

Why Privacy-Focused Tavern Users Choose KoboldCpp

No account, no telemetry to OpenAI — inference stays on your GPU/CPU.
Fine-grained hardware control — layer split, context, quant choice—popular with 8–12 GB VRAM rigs.
Native SillyTavern integration — Text Completion → KoboldCpp is the documented path on docs.ST.app.
Portable — copy one folder to a gaming PC or offline laptop; launch and play.

MiniTavern users benefit the same way: point the Multi-Model Hub at http://192.168.x.x:5001/v1 on your LAN, or use Remote Tunnel when you need phone access to a home PC (see our LM Studio LM Link guide for another encrypted remote pattern).

Prerequisites

OS: Windows, Linux, or macOS (ARM Mac builds available).
GPU: NVIDIA with 6 GB+ VRAM for 7B Q4 models; 12 GB+ comfortable for 8B–14B roleplay.
RAM: 16 GB system RAM minimum; 32 GB helps CPU offload.
Model: GGUF file from Hugging Face (e.g. Mistral 7B Instruct, Qwen2.5 7B, Llama 3.1 8B).
SillyTavern or MiniTavern with character cards ready (Card Quest Market or Chrome extension import).

Step 1: Download the Right KoboldCpp Build

Get the latest release from GitHub Releases:

Your hardware	Recommended file
Modern NVIDIA GPU	`koboldcpp.exe` (CUDA 12)
Older NVIDIA / weak CPU	`oldpc` variant (CUDA 11 + AVX)
AMD GPU	`nocuda` + Vulkan backend in GUI
Apple Silicon Mac	`koboldcpp-mac-arm64`
Linux NVIDIA	`koboldcpp-linux-x64`

Windows may show a SmartScreen warning—Run anyway (you are executing a local tool you downloaded).

Step 2: Download a GGUF Model

Search Hugging Face for roleplay-friendly instruct models:

Mistral-7B-Instruct-v0.3-GGUF
Qwen2.5-7B-Instruct-GGUF
Llama-3.1-8B-Instruct-GGUF

Pick Q4_K_M or Q5_K_M quants for 8 GB VRAM. Save the .gguf file somewhere memorable.

Step 3: Configure Quick Launch

Open KoboldCpp.
Quick Launch tab → Browse → select your .gguf.
Set Context Size to match your VRAM (4096–8192 for RP with World Info; higher = more VRAM).
GPU Layers: leave auto-filled value first run; tune later if you OOM or see CPU fallback slowness.
NVIDIA: enable Use CuBLAS; confirm GPU ID matches your card.
Hardware tab → enable High Priority (optional, reduces stutter).
Click Save so settings persist → Launch.

Wait for:

Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/

Test in KoboldAI Lite (opens in browser) before touching SillyTavern.

Step 4: Connect SillyTavern (Text Completion — Recommended)

Open SillyTavern → plug icon → API Connections.
API: Text Completion.
API Type: KoboldCpp.
Server URL: http://127.0.0.1:5001/ (or http://localhost:5001/).
Connect — ST should detect your loaded .gguf filename.
Import a character card → send a greeting.

Roleplay tuning:

Shorten card system prompts for 7B locals.
Match ST context to KoboldCpp context (ST cannot exceed what KoboldCpp launched with).
Temperature 0.7–0.9; rep pen 1.05–1.15 for less repetition.
More card tips: local LLM privacy guide.

Alternative: Chat Completion (OpenAI-Compatible)

API: Chat Completion.
Source: Custom (OpenAI-compatible).
Base URL: http://127.0.0.1:5001/v1.
Connect and select the model.

Use this if your preset expects chat-format APIs or you connect MiniTavern’s OpenAI-compatible hub.

Step 5: Connect MiniTavern on Mobile / LAN

Same Wi-Fi (recommended for phones):

Note your PC’s LAN IP (e.g. 192.168.1.50).
KoboldCpp must listen on the network (check launch flags / firewall; allow port 5001).
MiniTavern → custom endpoint → http://192.168.1.50:5001/v1.

Away from home:

Enable KoboldCpp Remote Tunnel for a temporary HTTPS link (convenience over raw port forward).
Or run SillyTavern on a VPN/Tailscale-connected laptop hitting localhost:5001.

Workflow: Character Card Market → Chrome Extension → MiniTavern iOS/Android with your home KoboldCpp endpoint.

VRAM & GPU Layers Cheat Sheet

VRAM	Suggested starting point
6 GB	7B Q4, context 4096, reduce GPU layers if OOM
8 GB	7B Q4/Q5 or 8B Q4, context 4096–6144
12 GB	8B–14B Q4, context 8192
16 GB+	14B Q4, higher context for lore-heavy cards

If layers spill to CPU, generation slows sharply—lower GPU Layers or use a smaller quant.

Recommended Models for Character Cards

Model	Quant	Notes
Qwen2.5 7B Instruct	Q4_K_M	Strong instruction following for cards
Mistral 7B Instruct v0.3	Q4_K_M	Fast, classic RP choice
Llama 3.1 8B Instruct	Q4_K_M	Balanced quality
Tiefighter / RP fine-tunes	Q4+	Community RP merges on Hugging Face

Avoid sub-3B for complex personalities and World Info.

Troubleshooting

Issue	Fix
ST context ignored above 4K	Raise Context Size in KoboldCpp before Launch
Connection refused	Confirm KoboldCpp running; URL `http://127.0.0.1:5001/`
CUDA error on launch	Try `oldpc` build or `nocuda` + Vulkan
Gibberish / wrong format	Use Text Completion + KoboldCpp type, or fix chat template
Slow after long chat	Context full—start new chat or summarize
Model not listed in ST	Reconnect after Launch completes

KoboldCpp vs LM Studio vs Ollama

	KoboldCpp	LM Studio	Ollama
Install	Portable binary	Desktop app	CLI/daemon
Default port	5001	1234	11434
ST native connector	KoboldCpp API type	KoboldAI / OpenAI	Ollama
GPU tuning	Deep (layers, quants)	GUI-friendly	Simpler
Remote mobile	Remote Tunnel	LM Link (Tailscale)	LAN mainly
Best for	Power users, ST veterans	GUI + model browser	Quick local pull

Many users keep KoboldCpp on a gaming PC and MiniTavern on phone over LAN—maximum privacy, no subscription.

Privacy Best Practices

Block outbound cloud fallbacks in ST/MiniTavern API settings.
Download models from trusted Hugging Face repos (check SHA / author).
Remote Tunnel exposes an endpoint—disable when not needed.
Encrypt sensitive PNG cards if storing personal lore on disk.
Update KoboldCpp regularly—security and speed fixes ship often.

Conclusion

KoboldCpp remains one of the most capable ways to run local LLM APIs for SillyTavern and MiniTavern character-card roleplay in 2026: portable, private, and deeply integrated with the tavern stack. Download a GGUF, launch on port 5001, connect ST with Text Completion → KoboldCpp, and your home GPU becomes the only inference provider you need.

Ready to build your library? Grab cards from the Character Card Market, install MiniTavern for mobile play, and point your connector at localhost:5001.

KoboldCpp Guide: Run Local LLMs for SillyTavern & MiniTavern (Privacy-First Setup in 2026)

KoboldCpp Guide: Run Local LLMs for SillyTavern & MiniTavern (Privacy-First Setup in 2026)

What Is KoboldCpp?

KoboldCpp vs KoboldAI (Classic)

Key KoboldCpp Terminology

Why Privacy-Focused Tavern Users Choose KoboldCpp

Prerequisites

Step 1: Download the Right KoboldCpp Build

Step 2: Download a GGUF Model

Step 3: Configure Quick Launch

Step 4: Connect SillyTavern (Text Completion — Recommended)

Alternative: Chat Completion (OpenAI-Compatible)

Step 5: Connect MiniTavern on Mobile / LAN

VRAM & GPU Layers Cheat Sheet

Recommended Models for Character Cards

Troubleshooting

KoboldCpp vs LM Studio vs Ollama

Privacy Best Practices

Conclusion

SillyTavern Character Cards on Android: How to Use and Optimize for Mobile

Mastering SillyTavern Character Card Rules: How to Define Roleplay Rules in Your Cards for Better AI Behavior

How to Create a Roleplay Character: A Step-by-Step Guide for AI Roleplay in 2026

KoboldCpp Guide: Run Local LLMs for SillyTavern & MiniTavern (Privacy-First Setup in 2026)

What Is KoboldCpp?

KoboldCpp vs KoboldAI (Classic)

Key KoboldCpp Terminology

Why Privacy-Focused Tavern Users Choose KoboldCpp

Prerequisites

Step 1: Download the Right KoboldCpp Build

Step 2: Download a GGUF Model

Step 3: Configure Quick Launch

Step 4: Connect SillyTavern (Text Completion — Recommended)

Alternative: Chat Completion (OpenAI-Compatible)

Step 5: Connect MiniTavern on Mobile / LAN

VRAM & GPU Layers Cheat Sheet

Recommended Models for Character Cards

Troubleshooting

KoboldCpp vs LM Studio vs Ollama

Privacy Best Practices

Conclusion

Keep reading

SillyTavern Character Cards on Android: How to Use and Optimize for Mobile

Mastering SillyTavern Character Card Rules: How to Define Roleplay Rules in Your Cards for Better AI Behavior

How to Create a Roleplay Character: A Step-by-Step Guide for AI Roleplay in 2026