Understand the key ideas behind LiteRTLM Gateway before building with it.
A conversation is the primary entity in the gateway. It holds configuration (system instruction, sampler settings, tools) and a message history. Every conversation has a unique name you choose at creation time.
Each inference turn consists of exactly one user message and one model reply. The full
history is replayed into the model at the start of every turn so it has context — but the
engine itself holds no persistent state between turns. Every turn creates a fresh native
Conversation object, uses it, and destroys it.
By default, conversations are stateful — every turn is persisted to the database and replayed as history on the next turn. The model remembers what was said earlier in the conversation.
In stateless mode, messages are never saved. Each turn is treated as a completely fresh context — the model has no memory of previous turns.
Enable stateless mode at conversation creation via the UI checkbox or the API:
POST /api/conversations
{
"name": "ephemeral-query",
"config": "assistant",
"stateless": true
}
The gateway does not replay the entire conversation history on every turn — only the last 10 messages (5 user + 5 model turns) are included. This is the history window.
The window slides as new messages arrive — older messages fall off. This keeps inference latency predictable and bounds the token count sent to the model regardless of how long the conversation has been running.
rogo_read_doc) to inject relevant information rather than relying
on conversation history alone.
Presets are named bundles of sampler settings and system instructions tuned for a specific use case. They are shortcuts — instead of specifying temperature, topK, and topP manually, you pick a preset that has sensible defaults for the task.
| Preset | Top-K | Top-P | Temperature | Best for |
|---|---|---|---|---|
| assistant | 40 | 0.95 | 0.8 | General-purpose Q&A, everyday chat |
| coder | 10 | 0.90 | 0.3 | Code generation, technical explanations — low variance, deterministic |
| concise | 10 | 0.85 | 0.2 | Brief factual answers, summaries — minimal elaboration |
| creative | 80 | 0.98 | 1.2 | Storytelling, brainstorming — high variance, surprising output |
You can override individual parameters (topK, topP, temperature) when creating a
conversation to tune a preset further. A custom system instruction sets the preset
label to custom.
Temperature controls how random the model's token selection is. At each step, the model assigns a probability to every possible next token. Temperature scales those probabilities before sampling.
Top-K limits the model's token selection to the K most probable next tokens at each step. All tokens outside the top K are discarded — their probability is set to zero — before sampling.
Top-P (also called nucleus sampling) is an alternative way to limit the token pool. Instead of taking the top K tokens by count, it takes the smallest set of tokens whose cumulative probability reaches P.
For example, at top_p = 0.9: the model ranks all tokens by probability,
then keeps adding them from most to least probable until their combined probability
reaches 90%. Only those tokens are considered for sampling.
Temperature, Top-K, and Top-P are applied in sequence on every token generation step:
// At each step:
1. Model computes raw logits for all tokens in vocabulary
2. Apply temperature → scale logits (lower = sharper, higher = flatter)
3. Apply Top-K → zero out all tokens outside the top K
4. Apply Top-P → zero out tokens that fall outside the nucleus
5. Sample → pick one token from the remaining distribution
In practice, Top-K acts as a hard ceiling on vocabulary size, and Top-P then trims it further based on cumulative probability. Temperature controls how aggressively the model favors its top choices within that pool.
| Goal | Temperature | Top-K | Top-P |
|---|---|---|---|
| Maximum determinism | 0.1 – 0.3 | 5 – 10 | 0.8 – 0.85 |
| Balanced general use | 0.7 – 0.9 | 30 – 50 | 0.92 – 0.95 |
| Maximum creativity | 1.0 – 1.3 | 60 – 100 | 0.96 – 0.99 |
Tools let the model call server-side functions during inference. When the model decides it needs external information (e.g. the current time, a calculation result, or a document), it emits a tool call. The gateway intercepts it, runs the tool, and returns the result to the model — all before any token reaches the client.
From the client's perspective, the reply just arrives as normal streaming tokens. Tool calls are completely transparent.
tools array.
A conversation can only use tools that were bound when it was created.
The tool list is stored in the DB and applied on every inference turn.
automaticToolCalling loop —
the model may call multiple tools in a single turn before generating its final answer.
The gateway runs a single engine on one dedicated thread. Only one inference task runs at a time. When multiple conversations send messages simultaneously, tasks queue up and are processed in submission order.
This is intentional — running two inferences simultaneously on a single CPU/GPU would slow both down. Serialized execution gives each turn full engine resources and predictable latency.
{ "type": "queued", "position": N } frame immediately. The reply will
arrive once all tasks ahead in the queue complete.
GET /messages.
GET /api/queue endpoint (and the Queue tab
in the UI) shows live queue state — which conversations are processing or waiting and their positions.
The Rogo Docs tools (rogo_list_docs and rogo_read_doc)
give the model access to files in a designated directory on the server. This is the primary
way to feed frequently-updated enterprise documents — policies, onboarding guides, runbooks,
product specs — to the model without redeploying.
./rogodocs/ next to the running JAR, or at a custom path
set via the ROGO_DOCS_DIR environment variable.
rogodocs/ and it is immediately visible to
rogo_list_docs. No restart needed.
rogo_list_docs to discover what is available,
then rogo_read_doc with the exact filename to read the content.
The two-step design lets the model decide which document is relevant.
.md, .txt,
.json, .yaml, .csv, etc. Binary files (PDFs, images)
are listed but their content will not be readable as text.
rogo_list_docs and rogo_read_doc
to conversations that should have access to these documents. Path traversal is prevented
server-side, but any file placed in rogodocs/ is accessible to any conversation
with these tools bound.