← Back to blog

KoboldCpp Guide: Run Local LLMs for SillyTavern & MiniTavern (Privacy-First Setup in 2026)

KoboldCpp is a portable llama.cpp inference server with Kobold and OpenAI-compatible APIs on port 5001—ideal for privacy-focused SillyTavern and MiniTavern character-card roleplay without cloud keys.

Published
  • koboldcpp
  • local llm
  • privacy
  • sillytavern
  • minitavern
  • tutorial

KoboldCpp Guide: Run Local LLMs for SillyTavern & MiniTavern (Privacy-First Setup in 2026)

Among local LLM backends for SillyTavern and MiniTavern, KoboldCpp is the veteran power-user choice: a single portable executable built on llama.cpp, no installer required, tuned for GGUF models, and wired into the tavern ecosystem since the KoboldAI days. If you want character-card roleplay with zero cloud API keys and full control over GPU layers, context size, and sampling—KoboldCpp deserves a spot on your shortlist.

This guide explains what KoboldCpp is, how it differs from LM Studio and Ollama, and walks through setup with SillyTavern and MiniTavern in 2026.

What Is KoboldCpp?

KoboldCpp (by LostRuins/koboldcpp) is a self-contained local inference server for GGUF and legacy GGML models. Download the binary for your OS, pick a model, click Launch, and you get:

  • Kobold API at http://localhost:5001/api/ (native text-completion protocol SillyTavern knows well)
  • OpenAI-compatible API at http://localhost:5001/v1/ (chat/completions for newer ST connectors)
  • KoboldAI Lite — embedded browser UI to sanity-check generations before opening the tavern

Unlike a cloud endpoint, your character cards, World Info, and chat logs never leave your machine.

KoboldCpp vs KoboldAI (Classic)

KoboldCppKoboldAI United (classic)
DistributionSingle portable .exe / binaryHeavier install, Colab notebooks
ModelsGGUF focusMixed formats
APIKobold + OpenAI on :5001Kobold API
SillyTavernFirst-class KoboldCpp API typeLegacy KoboldAI connector
MaintenanceActive 2026 releasesLargely superseded by KoboldCpp for local use

When guides say “connect SillyTavern to KoboldAI locally,” they usually mean KoboldCpp today.

Key KoboldCpp Terminology

TermMeaning
Quick LaunchGUI tab to browse a GGUF file, set context, GPU layers, and launch
GPU Layers (n_gpu_layers)How many model layers run on GPU vs CPU—critical for VRAM tuning
Context SizeMax tokens KoboldCpp allocates—must be set before launch (defaults can cap at 4K)
CuBLAS / CUDA backendNVIDIA GPU acceleration build (koboldcpp.exe)
nocuda buildSmaller binary; use Vulkan for AMD or CPU-only rigs
Kobold APIText-completion endpoint ST uses with API Type = KoboldCpp
Remote TunnelKoboldCpp feature to expose a temporary public URL (e.g. Cloudflare) for off-LAN access
.kcpptKoboldCpp preset/template file bundling model + launch settings
KoboldAI LiteBuilt-in lightweight chat page for testing after launch

Why Privacy-Focused Tavern Users Choose KoboldCpp

  1. No account, no telemetry to OpenAI — inference stays on your GPU/CPU.
  2. Fine-grained hardware control — layer split, context, quant choice—popular with 8–12 GB VRAM rigs.
  3. Native SillyTavern integrationText Completion → KoboldCpp is the documented path on docs.ST.app.
  4. Portable — copy one folder to a gaming PC or offline laptop; launch and play.

MiniTavern users benefit the same way: point the Multi-Model Hub at http://192.168.x.x:5001/v1 on your LAN, or use Remote Tunnel when you need phone access to a home PC (see our LM Studio LM Link guide for another encrypted remote pattern).

Prerequisites

  • OS: Windows, Linux, or macOS (ARM Mac builds available).
  • GPU: NVIDIA with 6 GB+ VRAM for 7B Q4 models; 12 GB+ comfortable for 8B–14B roleplay.
  • RAM: 16 GB system RAM minimum; 32 GB helps CPU offload.
  • Model: GGUF file from Hugging Face (e.g. Mistral 7B Instruct, Qwen2.5 7B, Llama 3.1 8B).
  • SillyTavern or MiniTavern with character cards ready (Card Quest Market or Chrome extension import).

Step 1: Download the Right KoboldCpp Build

Get the latest release from GitHub Releases:

Your hardwareRecommended file
Modern NVIDIA GPUkoboldcpp.exe (CUDA 12)
Older NVIDIA / weak CPUoldpc variant (CUDA 11 + AVX)
AMD GPUnocuda + Vulkan backend in GUI
Apple Silicon Mackoboldcpp-mac-arm64
Linux NVIDIAkoboldcpp-linux-x64

Windows may show a SmartScreen warning—Run anyway (you are executing a local tool you downloaded).

Step 2: Download a GGUF Model

Search Hugging Face for roleplay-friendly instruct models:

  • Mistral-7B-Instruct-v0.3-GGUF
  • Qwen2.5-7B-Instruct-GGUF
  • Llama-3.1-8B-Instruct-GGUF

Pick Q4_K_M or Q5_K_M quants for 8 GB VRAM. Save the .gguf file somewhere memorable.

Step 3: Configure Quick Launch

  1. Open KoboldCpp.
  2. Quick Launch tab → Browse → select your .gguf.
  3. Set Context Size to match your VRAM (4096–8192 for RP with World Info; higher = more VRAM).
  4. GPU Layers: leave auto-filled value first run; tune later if you OOM or see CPU fallback slowness.
  5. NVIDIA: enable Use CuBLAS; confirm GPU ID matches your card.
  6. Hardware tab → enable High Priority (optional, reduces stutter).
  7. Click Save so settings persist → Launch.

Wait for:

Starting Kobold API on port 5001 at http://localhost:5001/api/
Starting OpenAI Compatible API on port 5001 at http://localhost:5001/v1/

Test in KoboldAI Lite (opens in browser) before touching SillyTavern.

  1. Open SillyTavern → plug iconAPI Connections.
  2. API: Text Completion.
  3. API Type: KoboldCpp.
  4. Server URL: http://127.0.0.1:5001/ (or http://localhost:5001/).
  5. Connect — ST should detect your loaded .gguf filename.
  6. Import a character card → send a greeting.

Roleplay tuning:

  • Shorten card system prompts for 7B locals.
  • Match ST context to KoboldCpp context (ST cannot exceed what KoboldCpp launched with).
  • Temperature 0.7–0.9; rep pen 1.05–1.15 for less repetition.
  • More card tips: local LLM privacy guide.

Alternative: Chat Completion (OpenAI-Compatible)

  1. API: Chat Completion.
  2. Source: Custom (OpenAI-compatible).
  3. Base URL: http://127.0.0.1:5001/v1.
  4. Connect and select the model.

Use this if your preset expects chat-format APIs or you connect MiniTavern’s OpenAI-compatible hub.

Step 5: Connect MiniTavern on Mobile / LAN

Same Wi-Fi (recommended for phones):

  1. Note your PC’s LAN IP (e.g. 192.168.1.50).
  2. KoboldCpp must listen on the network (check launch flags / firewall; allow port 5001).
  3. MiniTavern → custom endpoint → http://192.168.1.50:5001/v1.

Away from home:

  • Enable KoboldCpp Remote Tunnel for a temporary HTTPS link (convenience over raw port forward).
  • Or run SillyTavern on a VPN/Tailscale-connected laptop hitting localhost:5001.

Workflow: Character Card MarketChrome Extension → MiniTavern iOS/Android with your home KoboldCpp endpoint.

VRAM & GPU Layers Cheat Sheet

VRAMSuggested starting point
6 GB7B Q4, context 4096, reduce GPU layers if OOM
8 GB7B Q4/Q5 or 8B Q4, context 4096–6144
12 GB8B–14B Q4, context 8192
16 GB+14B Q4, higher context for lore-heavy cards

If layers spill to CPU, generation slows sharply—lower GPU Layers or use a smaller quant.

ModelQuantNotes
Qwen2.5 7B InstructQ4_K_MStrong instruction following for cards
Mistral 7B Instruct v0.3Q4_K_MFast, classic RP choice
Llama 3.1 8B InstructQ4_K_MBalanced quality
Tiefighter / RP fine-tunesQ4+Community RP merges on Hugging Face

Avoid sub-3B for complex personalities and World Info.

Troubleshooting

IssueFix
ST context ignored above 4KRaise Context Size in KoboldCpp before Launch
Connection refusedConfirm KoboldCpp running; URL http://127.0.0.1:5001/
CUDA error on launchTry oldpc build or nocuda + Vulkan
Gibberish / wrong formatUse Text Completion + KoboldCpp type, or fix chat template
Slow after long chatContext full—start new chat or summarize
Model not listed in STReconnect after Launch completes

KoboldCpp vs LM Studio vs Ollama

KoboldCppLM StudioOllama
InstallPortable binaryDesktop appCLI/daemon
Default port5001123411434
ST native connectorKoboldCpp API typeKoboldAI / OpenAIOllama
GPU tuningDeep (layers, quants)GUI-friendlySimpler
Remote mobileRemote TunnelLM Link (Tailscale)LAN mainly
Best forPower users, ST veteransGUI + model browserQuick local pull

Many users keep KoboldCpp on a gaming PC and MiniTavern on phone over LAN—maximum privacy, no subscription.

Privacy Best Practices

  1. Block outbound cloud fallbacks in ST/MiniTavern API settings.
  2. Download models from trusted Hugging Face repos (check SHA / author).
  3. Remote Tunnel exposes an endpoint—disable when not needed.
  4. Encrypt sensitive PNG cards if storing personal lore on disk.
  5. Update KoboldCpp regularly—security and speed fixes ship often.

Conclusion

KoboldCpp remains one of the most capable ways to run local LLM APIs for SillyTavern and MiniTavern character-card roleplay in 2026: portable, private, and deeply integrated with the tavern stack. Download a GGUF, launch on port 5001, connect ST with Text Completion → KoboldCpp, and your home GPU becomes the only inference provider you need.

Ready to build your library? Grab cards from the Character Card Market, install MiniTavern for mobile play, and point your connector at localhost:5001.

More guides you might like