Use the REST and WebSocket API to integrate LiteRTLM into your application.
http://<host>:<port>
All REST endpoints are prefixed with /api. The WebSocket endpoint is at /ws.
Two credential types are accepted depending on the endpoint:
| Type | Header | Endpoints |
|---|---|---|
| JWT | Authorization: Bearer <accessToken> | All /api/conversations/*, /api/api-key/* |
| API Key | Authorization: Bearer lrtlm_<key> | All /api/conversations/*, WebSocket |
API key management endpoints (/api/api-key/*) require JWT only.
Authenticate and receive a JWT pair. No auth header required.
{
"username": "admin",
"password": "your-password"
}
{
"ok": true,
"accessToken": "eyJ...",
"refreshToken": "eyJ..."
}
Exchange a refresh token for a new JWT pair. Refresh tokens rotate on every use.
{
"refreshToken": "eyJ..."
}
{
"ok": true,
"accessToken": "eyJ...",
"refreshToken": "eyJ..."
}
Invalidate the current session. The access token expires naturally after 15 minutes.
{
"refreshToken": "eyJ..."
}
{ "ok": true }
All conversation endpoints accept JWT or API key via Authorization: Bearer.
List all conversations.
{
"ok": true,
"conversations": ["my-chat", "code-review"]
}
Create a new conversation. Choose a builtin preset or supply a custom system instruction. Optionally bind tools by name.
{
"name": "my-chat",
"config": "assistant" // "assistant" | "coder" | "concise" | "creative"
}
{
"name": "my-chat",
"config": "assistant",
"tools": ["datetime", "calculator"] // optional — bind tools by name
}
{
"name": "my-chat",
"systemInstruction": "You are a pirate. Respond only in pirate speak.",
"tools": ["datetime"] // tools work with custom instructions too
}
{
"ok": true,
"name": "my-chat",
"config": "assistant"
}
Retrieve full message history for a conversation, ordered oldest-first.
{
"ok": true,
"messages": [
{ "role": "user", "text": "Hello", "seq": 0, "createdAt": 1700000000000 },
{ "role": "model", "text": "Hi! How can I help?", "seq": 1, "createdAt": 1700000000000 }
]
}
Send a message and receive the full reply in a single blocking response. For streaming, use the WebSocket endpoint instead.
{
"message": "Explain binary search trees"
}
{
"ok": true,
"reply": "A binary search tree is a data structure..."
}
Permanently delete a conversation and all its message history.
{ "ok": true }
Tools let the model call server-side functions during inference. When a tool is bound to a conversation, the model can invoke it automatically — the entire call/response loop happens inside the SDK before any token reaches the client. From the client's perspective, the reply arrives as normal streaming tokens.
List all tools currently registered on the server. No authentication required.
{
"ok": true,
"tools": [
{
"name": "datetime",
"description": "Get the current date and time...",
"parameters": [
{ "name": "format", "type": "STRING", "description": "Date format pattern", "required": false },
{ "name": "timezone", "type": "STRING", "description": "IANA timezone ID", "required": false }
]
},
{
"name": "calculator",
"description": "Evaluate a mathematical expression...",
"parameters": [
{ "name": "expression", "type": "STRING", "description": "A math expression", "required": true }
]
}
]
}
Pass "tools" when creating a conversation to bind tools by name.
Tools are resolved from the server registry — unknown names are silently skipped.
Omit "tools" or pass [] for a tool-free conversation.
{
"name": "research-chat",
"config": "assistant",
"tools": ["datetime", "calculator"]
}
| Name | Description | Parameters |
|---|---|---|
| datetime | Returns the current date and time |
format (optional) — Java date pattern, e.g. yyyy-MM-dd HH:mm:sstimezone (optional) — IANA ID, e.g. Asia/Tokyo
|
| calculator | Evaluates a math expression and returns the result |
expression (required) — e.g. (3 + 5) * 2, Math.sqrt(16)
|
| rogo_list_docs | Lists all enterprise documents available in the rogodocs directory. Returns name, size, and last-modified date for each file. | None |
| rogo_read_doc | Reads the full content of an enterprise document by filename. Call rogo_list_docs first to discover available files. |
filename (required) — exact filename, e.g. hr-policy.md
|
// 1. Create a conversation with tools bound POST /api/conversations { "name": "helper", "config": "assistant", "tools": ["datetime", "calculator"] } // 2. Send a message — tool calls happen inside the model, transparently WS /ws/conversations/helper?token=lrtlm_... → { "message": "What is 2 to the power of 10, and what time is it in Tokyo?" } // Model calls: calculator("Math.pow(2,10)") → "1024" // datetime(timezone="Asia/Tokyo") → "2026-04-11 18:30:00" // Then generates the final reply using both results. ← { "type": "token", "token": "2 to the power of 10 is 1024..." } ← { "type": "done" } // 3. The client sees only the final answer — tool calls are invisible
If a tool fails (bad parameters, runtime error), it returns a descriptive error string
to the model — e.g. "Error: could not evaluate expression '1/0': Division by zero".
The model reads this as the tool result and responds accordingly.
Tool errors never interrupt streaming or cause a 503.
The gateway processes inference tasks one at a time on a single engine. When multiple conversations send messages simultaneously, tasks queue and are processed in submission order.
Returns the current queue snapshot — which conversations are processing or waiting, and their positions. No authentication required.
{
"ok": true,
"size": 0,
"queue": []
}
{
"ok": true,
"size": 2,
"queue": [
{ "name": "chat-1", "position": 1, "status": "processing" },
{ "name": "chat-5", "position": 2, "status": "waiting" }
]
}
position is 1-based. Position 1 is the conversation currently being processed.
The status field is either "processing" or "waiting".
Stream model replies token-by-token over a persistent WebSocket connection.
Pass your JWT access token or API key as the token query parameter.
The connection stays open across multiple turns — send a new message after receiving "done".
Inference runs detached — if you disconnect mid-generation, the reply still completes and persists to DB.
{ "message": "What is Kotlin coroutines?" }
| Type | When | Fields |
|---|---|---|
| queued | Engine is busy with another conversation — your task is waiting | position — 1-based queue position |
| busy | Inference is now running for your message | — |
| token | One partial token streamed from the model | token — partial text string |
| done | Turn complete — authoritative full reply | reply — full reply text |
| error | Recoverable error — connection stays open | error — error message |
// Client sends: { "message": "Explain binary search" } // Server responds: { "type": "busy" } { "type": "token", "token": "Binary" } { "type": "token", "token": " search" } { "type": "token", "token": " is..." } // ... one frame per token ... { "type": "done", "reply": "Binary search is a fast algorithm..." }
// Server responds immediately with queue position: { "type": "queued", "position": 2 } // Then when it's your turn: { "type": "busy" } { "type": "token", "token": "Binary" } // ... { "type": "done", "reply": "Binary search is a fast algorithm..." }
const ws = new WebSocket(
'ws://localhost:8080/ws/conversations/my-chat?token=lrtlm_...'
);
let streamBuffer = '';
ws.onopen = () => {
ws.send(JSON.stringify({ message: 'Hello!' }));
};
ws.onmessage = (evt) => {
const frame = JSON.parse(evt.data);
if (frame.type === 'queued') {
console.log('Queued at position', frame.position);
} else if (frame.type === 'busy') {
console.log('Inference running...');
streamBuffer = '';
} else if (frame.type === 'token') {
streamBuffer += frame.token;
process.stdout.write(frame.token); // live streaming
} else if (frame.type === 'done') {
console.log('\n[done]', frame.reply); // authoritative full reply
} else if (frame.type === 'error') {
console.error('[error]', frame.error);
}
};
| Scenario | Behaviour |
|---|---|
| Disconnect mid-inference | Inference keeps running. Reply persists to DB on completion. Reload history via GET /messages. |
| Reconnect while inference running | Server sends busy, streams tokens from that point on, then done with full reply. |
| Reconnect after inference done | No frames sent. Load history via GET /api/conversations/{name}/messages. |
These endpoints require JWT only (Authorization: Bearer <accessToken>).
Generate a new API key. The raw key is returned once — store it immediately.
{ "name": "mobile-client" }
{
"ok": true,
"key": "lrtlm_...", // raw key — shown once only
"id": "uuid",
"prefix": "lrtlm_Xx",
"name": "mobile-client"
}
List all API keys (active and revoked). Raw keys are never returned.
{
"ok": true,
"keys": [
{
"id": "uuid",
"prefix": "lrtlm_Xx",
"name": "mobile-client",
"active": true,
"createdAt": 1700000000000,
"lastUsedAt": 1700000001000
}
]
}
Look up metadata for a specific key by its raw value.
Soft-revoke an API key. The record is kept for audit purposes.
{ "key": "lrtlm_..." }
{ "ok": true }
All errors follow the same structure:
{ "ok": false, "error": "Human-readable message" }
| Status | Meaning |
|---|---|
| 400 | Bad request — missing or invalid field |
| 401 | Unauthorized — missing, invalid, or expired token |
| 404 | Resource not found |
| 409 | Conflict — resource already exists |
| 503 | Engine not ready — model is still loading |