Concepts

Understand the key ideas behind LiteRTLM Gateway before building with it.

Conversations

A conversation is the primary entity in the gateway. It holds configuration (system instruction, sampler settings, tools) and a message history. Every conversation has a unique name you choose at creation time.

Each inference turn consists of exactly one user message and one model reply. The full history is replayed into the model at the start of every turn so it has context — but the engine itself holds no persistent state between turns. Every turn creates a fresh native Conversation object, uses it, and destroys it.

Key insight: The gateway is the memory. The model engine is stateless. History lives in SQLite and is injected fresh on every request.
Stateless Mode

By default, conversations are stateful — every turn is persisted to the database and replayed as history on the next turn. The model remembers what was said earlier in the conversation.

In stateless mode, messages are never saved. Each turn is treated as a completely fresh context — the model has no memory of previous turns.

Stateful
Messages persisted to DB. History replayed on every turn. Model remembers the conversation.
Stateless
Messages never saved. Each turn is independent. Model has no memory between turns.
⚠ Warning: Stateless conversations cannot be resumed. There is no history to load. Use stateless mode for one-shot queries, automated pipelines, or when privacy requires no persistence.

Enable stateless mode at conversation creation via the UI checkbox or the API:

POST /api/conversations
{
  "name":      "ephemeral-query",
  "config":    "assistant",
  "stateless": true
}
History Window

The gateway does not replay the entire conversation history on every turn — only the last 10 messages (5 user + 5 model turns) are included. This is the history window.

The window slides as new messages arrive — older messages fall off. This keeps inference latency predictable and bounds the token count sent to the model regardless of how long the conversation has been running.

Practical effect: In a very long conversation, the model may "forget" things discussed many turns ago. If long-term memory matters, consider using a tool (e.g. rogo_read_doc) to inject relevant information rather than relying on conversation history alone.
Presets

Presets are named bundles of sampler settings and system instructions tuned for a specific use case. They are shortcuts — instead of specifying temperature, topK, and topP manually, you pick a preset that has sensible defaults for the task.

PresetTop-KTop-PTemperatureBest for
assistant 400.950.8 General-purpose Q&A, everyday chat
coder 100.900.3 Code generation, technical explanations — low variance, deterministic
concise 100.850.2 Brief factual answers, summaries — minimal elaboration
creative 800.981.2 Storytelling, brainstorming — high variance, surprising output

You can override individual parameters (topK, topP, temperature) when creating a conversation to tune a preset further. A custom system instruction sets the preset label to custom.

Temperature

Temperature controls how random the model's token selection is. At each step, the model assigns a probability to every possible next token. Temperature scales those probabilities before sampling.

Low 0.1 – 0.4
Probabilities are sharpened — the highest-probability token wins almost every time. Output is predictable, consistent, and repetitive. Good for code, factual answers.
Medium 0.6 – 0.9
Balanced. The model follows likely paths but occasionally picks interesting alternatives. Good for general conversation.
High 1.0 – 1.5+
Probabilities are flattened — unlikely tokens get a real chance. Output is varied, surprising, and sometimes incoherent. Good for creative writing, brainstorming.
⚠ Note: Very high temperatures (> 1.5) can cause the model to produce nonsensical output. Keep it below 1.2 unless experimenting.
Top-K

Top-K limits the model's token selection to the K most probable next tokens at each step. All tokens outside the top K are discarded — their probability is set to zero — before sampling.

Low K 5 – 15
Very narrow selection. The model stays close to what it considers most likely. Deterministic and focused. Can feel repetitive.
Medium K 20 – 50
Balanced vocabulary. The model can choose from a reasonable range of plausible words. Good default.
High K 60 – 100
Wide selection pool. The model can pick from many alternatives including less probable ones. More diverse output. Combine with high temperature for maximum creativity.
Top-P (Nucleus Sampling)

Top-P (also called nucleus sampling) is an alternative way to limit the token pool. Instead of taking the top K tokens by count, it takes the smallest set of tokens whose cumulative probability reaches P.

For example, at top_p = 0.9: the model ranks all tokens by probability, then keeps adding them from most to least probable until their combined probability reaches 90%. Only those tokens are considered for sampling.

Low P 0.7 – 0.85
Tight nucleus. Only the most confident tokens are included. Focused and conservative.
High P 0.92 – 0.99
Wide nucleus. More tokens included, including less likely options. More varied output.
Top-P vs Top-K: Top-K is a fixed count; Top-P is adaptive — it naturally includes more tokens when the model is uncertain (flat distribution) and fewer when it is confident (sharp distribution). Most models use both together.
How Sampling Parameters Work Together

Temperature, Top-K, and Top-P are applied in sequence on every token generation step:

// At each step:
1. Model computes raw logits for all tokens in vocabulary
2. Apply temperature  → scale logits (lower = sharper, higher = flatter)
3. Apply Top-K        → zero out all tokens outside the top K
4. Apply Top-P        → zero out tokens that fall outside the nucleus
5. Sample             → pick one token from the remaining distribution

In practice, Top-K acts as a hard ceiling on vocabulary size, and Top-P then trims it further based on cumulative probability. Temperature controls how aggressively the model favors its top choices within that pool.

GoalTemperatureTop-KTop-P
Maximum determinism0.1 – 0.35 – 100.8 – 0.85
Balanced general use0.7 – 0.930 – 500.92 – 0.95
Maximum creativity1.0 – 1.360 – 1000.96 – 0.99
Tools

Tools let the model call server-side functions during inference. When the model decides it needs external information (e.g. the current time, a calculation result, or a document), it emits a tool call. The gateway intercepts it, runs the tool, and returns the result to the model — all before any token reaches the client.

From the client's perspective, the reply just arrives as normal streaming tokens. Tool calls are completely transparent.

Binding
Tools are bound at conversation creation time via the tools array. A conversation can only use tools that were bound when it was created. The tool list is stored in the DB and applied on every inference turn.
Execution
Tool calls happen inside the LiteRTLM SDK's automaticToolCalling loop — the model may call multiple tools in a single turn before generating its final answer.
Errors
If a tool fails, the error message is returned to the model as the tool result. The model reads it and responds accordingly — tool errors never crash inference.
When to bind tools: Only bind tools the model actually needs for that conversation. Unnecessary tools waste context tokens in the system prompt and may confuse the model into calling tools when it shouldn't.
Inference Queue

The gateway runs a single engine on one dedicated thread. Only one inference task runs at a time. When multiple conversations send messages simultaneously, tasks queue up and are processed in submission order.

This is intentional — running two inferences simultaneously on a single CPU/GPU would slow both down. Serialized execution gives each turn full engine resources and predictable latency.

Queued
When your message is submitted while the engine is busy, the WebSocket receives a { "type": "queued", "position": N } frame immediately. The reply will arrive once all tasks ahead in the queue complete.
Detached
Inference runs detached from the WebSocket connection. If the client disconnects, the inference keeps running and the reply is persisted to the database. The client can reconnect and load the completed reply via GET /messages.
Queue monitor: The GET /api/queue endpoint (and the Queue tab in the UI) shows live queue state — which conversations are processing or waiting and their positions.
Rogo Docs (Enterprise Documents)

The Rogo Docs tools (rogo_list_docs and rogo_read_doc) give the model access to files in a designated directory on the server. This is the primary way to feed frequently-updated enterprise documents — policies, onboarding guides, runbooks, product specs — to the model without redeploying.

Directory
Files live in ./rogodocs/ next to the running JAR, or at a custom path set via the ROGO_DOCS_DIR environment variable.
Live reload
Drop a new file into rogodocs/ and it is immediately visible to rogo_list_docs. No restart needed.
Flow
The model calls rogo_list_docs to discover what is available, then rogo_read_doc with the exact filename to read the content. The two-step design lets the model decide which document is relevant.
Supported formats: Any plain-text file — .md, .txt, .json, .yaml, .csv, etc. Binary files (PDFs, images) are listed but their content will not be readable as text.
⚠ Security: Only bind rogo_list_docs and rogo_read_doc to conversations that should have access to these documents. Path traversal is prevented server-side, but any file placed in rogodocs/ is accessible to any conversation with these tools bound.