KoboldCpp Guide: Run Local LLMs for SillyTavern & MiniTavern (Privacy-First Setup in 2026)
KoboldCpp is a portable llama.cpp inference server with Kobold and OpenAI-compatible APIs on port 5001—ideal for privacy-focused SillyTavern and MiniTavern character-card roleplay without cloud keys.
- koboldcpp
- local llm
- privacy
- sillytavern
- minitavern
- tutorial
KoboldCpp Guide: Run Local LLMs for SillyTavern & MiniTavern (Privacy-First Setup in 2026)
Among local LLM backends for SillyTavern and MiniTavern, KoboldCpp is the veteran power-user choice: a single portable executable built on llama.cpp, no installer required, tuned for GGUF models, and wired into the tavern ecosystem since the KoboldAI days. If you want character-card roleplay with zero cloud API keys and full control over GPU layers, context size, and sampling—KoboldCpp deserves a spot on your shortlist.
This guide explains what KoboldCpp is, how it differs from LM Studio and Ollama, and walks through setup with SillyTavern and MiniTavern in 2026.
What Is KoboldCpp?
KoboldCpp (by LostRuins/koboldcpp) is a self-contained local inference server for GGUF and legacy GGML models. Download the binary for your OS, pick a model, click Launch, and you get:
- Kobold API at
http://localhost:5001/api/(native text-completion protocol SillyTavern knows well) - OpenAI-compatible API at
http://localhost:5001/v1/(chat/completions for newer ST connectors) - KoboldAI Lite — embedded browser UI to sanity-check generations before opening the tavern
Unlike a cloud endpoint, your character cards, World Info, and chat logs never leave your machine.
KoboldCpp vs KoboldAI (Classic)
| KoboldCpp | KoboldAI United (classic) | |
|---|---|---|
| Distribution | Single portable .exe / binary | Heavier install, Colab notebooks |
| Models | GGUF focus | Mixed formats |
| API | Kobold + OpenAI on :5001 | Kobold API |
| SillyTavern | First-class KoboldCpp API type | Legacy KoboldAI connector |
| Maintenance | Active 2026 releases | Largely superseded by KoboldCpp for local use |
When guides say “connect SillyTavern to KoboldAI locally,” they usually mean KoboldCpp today.
Key KoboldCpp Terminology
| Term | Meaning |
|---|---|
| Quick Launch | GUI tab to browse a GGUF file, set context, GPU layers, and launch |
GPU Layers (n_gpu_layers) | How many model layers run on GPU vs CPU—critical for VRAM tuning |
| Context Size | Max tokens KoboldCpp allocates—must be set before launch (defaults can cap at 4K) |
| CuBLAS / CUDA backend | NVIDIA GPU acceleration build (koboldcpp.exe) |
| nocuda build | Smaller binary; use Vulkan for AMD or CPU-only rigs |
| Kobold API | Text-completion endpoint ST uses with API Type = KoboldCpp |
| Remote Tunnel | KoboldCpp feature to expose a temporary public URL (e.g. Cloudflare) for off-LAN access |
| .kcppt | KoboldCpp preset/template file bundling model + launch settings |
| KoboldAI Lite | Built-in lightweight chat page for testing after launch |
Why Privacy-Focused Tavern Users Choose KoboldCpp
- No account, no telemetry to OpenAI — inference stays on your GPU/CPU.
- Fine-grained hardware control — layer split, context, quant choice—popular with 8–12 GB VRAM rigs.
- Native SillyTavern integration — Text Completion → KoboldCpp is the documented path on docs.ST.app.
- Portable — copy one folder to a gaming PC or offline laptop; launch and play.
MiniTavern users benefit the same way: point the Multi-Model Hub at http://192.168.x.x:5001/v1 on your LAN, or use Remote Tunnel when you need phone access to a home PC (see our LM Studio LM Link guide for another encrypted remote pattern).
Prerequisites
- OS: Windows, Linux, or macOS (ARM Mac builds available).
- GPU: NVIDIA with 6 GB+ VRAM for 7B Q4 models; 12 GB+ comfortable for 8B–14B roleplay.
- RAM: 16 GB system RAM minimum; 32 GB helps CPU offload.
- Model: GGUF file from Hugging Face (e.g. Mistral 7B Instruct, Qwen2.5 7B, Llama 3.1 8B).
- SillyTavern or MiniTavern with character cards ready (Card Quest Market or Chrome extension import).
Step 1: Download the Right KoboldCpp Build
Get the latest release from GitHub Releases:
| Your hardware | Recommended file |
|---|---|
| Modern NVIDIA GPU | koboldcpp.exe (CUDA 12) |
| Older NVIDIA / weak CPU | oldpc variant (CUDA 11 + AVX) |
| AMD GPU | nocuda + Vulkan backend in GUI |
| Apple Silicon Mac | koboldcpp-mac-arm64 |
| Linux NVIDIA | koboldcpp-linux-x64 |
Windows may show a SmartScreen warning—Run anyway (you are executing a local tool you downloaded).
Step 2: Download a GGUF Model
Search Hugging Face for roleplay-friendly instruct models:
Mistral-7B-Instruct-v0.3-GGUFQwen2.5-7B-Instruct-GGUFLlama-3.1-8B-Instruct-GGUF
Pick Q4_K_M or Q5_K_M quants for 8 GB VRAM. Save the .gguf file somewhere memorable.
Step 3: Configure Quick Launch
- Open KoboldCpp.
- Quick Launch tab → Browse → select your
.gguf. - Set Context Size to match your VRAM (4096–8192 for RP with World Info; higher = more VRAM).
- GPU Layers: leave auto-filled value first run; tune later if you OOM or see CPU fallback slowness.
- NVIDIA: enable Use CuBLAS; confirm GPU ID matches your card.
- Hardware tab → enable High Priority (optional, reduces stutter).
- Click Save so settings persist → Launch.
Wait for:
Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/
Test in KoboldAI Lite (opens in browser) before touching SillyTavern.
Step 4: Connect SillyTavern (Text Completion — Recommended)
- Open SillyTavern → plug icon → API Connections.
- API: Text Completion.
- API Type: KoboldCpp.
- Server URL:
http://127.0.0.1:5001/(orhttp://localhost:5001/). - Connect — ST should detect your loaded
.gguffilename. - Import a character card → send a greeting.
Roleplay tuning:
- Shorten card system prompts for 7B locals.
- Match ST context to KoboldCpp context (ST cannot exceed what KoboldCpp launched with).
- Temperature 0.7–0.9; rep pen 1.05–1.15 for less repetition.
- More card tips: local LLM privacy guide.
Alternative: Chat Completion (OpenAI-Compatible)
- API: Chat Completion.
- Source: Custom (OpenAI-compatible).
- Base URL:
http://127.0.0.1:5001/v1. - Connect and select the model.
Use this if your preset expects chat-format APIs or you connect MiniTavern’s OpenAI-compatible hub.
Step 5: Connect MiniTavern on Mobile / LAN
Same Wi-Fi (recommended for phones):
- Note your PC’s LAN IP (e.g.
192.168.1.50). - KoboldCpp must listen on the network (check launch flags / firewall; allow port 5001).
- MiniTavern → custom endpoint →
http://192.168.1.50:5001/v1.
Away from home:
- Enable KoboldCpp Remote Tunnel for a temporary HTTPS link (convenience over raw port forward).
- Or run SillyTavern on a VPN/Tailscale-connected laptop hitting
localhost:5001.
Workflow: Character Card Market → Chrome Extension → MiniTavern iOS/Android with your home KoboldCpp endpoint.
VRAM & GPU Layers Cheat Sheet
| VRAM | Suggested starting point |
|---|---|
| 6 GB | 7B Q4, context 4096, reduce GPU layers if OOM |
| 8 GB | 7B Q4/Q5 or 8B Q4, context 4096–6144 |
| 12 GB | 8B–14B Q4, context 8192 |
| 16 GB+ | 14B Q4, higher context for lore-heavy cards |
If layers spill to CPU, generation slows sharply—lower GPU Layers or use a smaller quant.
Recommended Models for Character Cards
| Model | Quant | Notes |
|---|---|---|
| Qwen2.5 7B Instruct | Q4_K_M | Strong instruction following for cards |
| Mistral 7B Instruct v0.3 | Q4_K_M | Fast, classic RP choice |
| Llama 3.1 8B Instruct | Q4_K_M | Balanced quality |
| Tiefighter / RP fine-tunes | Q4+ | Community RP merges on Hugging Face |
Avoid sub-3B for complex personalities and World Info.
Troubleshooting
| Issue | Fix |
|---|---|
| ST context ignored above 4K | Raise Context Size in KoboldCpp before Launch |
| Connection refused | Confirm KoboldCpp running; URL http://127.0.0.1:5001/ |
| CUDA error on launch | Try oldpc build or nocuda + Vulkan |
| Gibberish / wrong format | Use Text Completion + KoboldCpp type, or fix chat template |
| Slow after long chat | Context full—start new chat or summarize |
| Model not listed in ST | Reconnect after Launch completes |
KoboldCpp vs LM Studio vs Ollama
| KoboldCpp | LM Studio | Ollama | |
|---|---|---|---|
| Install | Portable binary | Desktop app | CLI/daemon |
| Default port | 5001 | 1234 | 11434 |
| ST native connector | KoboldCpp API type | KoboldAI / OpenAI | Ollama |
| GPU tuning | Deep (layers, quants) | GUI-friendly | Simpler |
| Remote mobile | Remote Tunnel | LM Link (Tailscale) | LAN mainly |
| Best for | Power users, ST veterans | GUI + model browser | Quick local pull |
Many users keep KoboldCpp on a gaming PC and MiniTavern on phone over LAN—maximum privacy, no subscription.
Privacy Best Practices
- Block outbound cloud fallbacks in ST/MiniTavern API settings.
- Download models from trusted Hugging Face repos (check SHA / author).
- Remote Tunnel exposes an endpoint—disable when not needed.
- Encrypt sensitive PNG cards if storing personal lore on disk.
- Update KoboldCpp regularly—security and speed fixes ship often.
Conclusion
KoboldCpp remains one of the most capable ways to run local LLM APIs for SillyTavern and MiniTavern character-card roleplay in 2026: portable, private, and deeply integrated with the tavern stack. Download a GGUF, launch on port 5001, connect ST with Text Completion → KoboldCpp, and your home GPU becomes the only inference provider you need.
Ready to build your library? Grab cards from the Character Card Market, install MiniTavern for mobile play, and point your connector at localhost:5001.
Keep reading
More guides you might like
SillyTavern Character Cards on Android: How to Use and Optimize for Mobile
If you’ve ever tried running SillyTavern on your Android phone, you know the magic of having AI character conversations in your pocket. But getting the mos…
- android
- mobile
- sillytavern
- character-cards
Mastering SillyTavern Character Card Rules: How to Define Roleplay Rules in Your Cards for Better AI Behavior
Creating a compelling character card in SillyTavern is an art form. You can craft the most detailed personality, backstory, and appearance, but if your car…
- sillytavern
- character cards
- roleplay rules
- ai behavior
How to Create a Roleplay Character: A Step-by-Step Guide for AI Roleplay in 2026
Creating a compelling character for AI roleplay is more than just writing a name and a backstory. In 2026, character creation has evolved into a nuanced cr…
- roleplay
- character-creation
- ai-roleplay
- guide