Quick Start
curl -X POST https://api.cuadra.ai/v1/chats \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"modelId": "model_abc123",
"messages": [{"role": "user", "content": "Hello!"}]
}'
Streaming Responses
Enable stream: true for real-time responses via Server-Sent Events:
curl -X POST https://api.cuadra.ai/v1/chats \
-H "Authorization: Bearer YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"modelId": "model_abc", "messages": [...], "stream": true}'
Stream format:
data: {"id":"chat_xyz","delta":"Once","finished":false}
data: {"id":"chat_xyz","delta":" upon","finished":false}
data: {"id":"chat_xyz","delta":"","finished":true,"usage":{...}}
data: [DONE]
For Vercel AI SDK compatibility, add the header:
Events: start, text-delta, source-document, reasoning-delta, tool-input-delta, finish
Reasoning (Extended Thinking)
Enable enableReasoning: true to see the model’s thinking process. Supported by Claude (Sonnet/Opus), OpenAI o1/o3, and Gemini thinking models.
{
"modelId": "model_claude",
"messages": [...],
"enableReasoning": true,
"reasoningBudget": 10000
}
Reasoning tokens are billed separately. Use reasoningBudget to cap costs.
Structured Outputs (JSON Mode)
Force JSON schema compliance with responseFormat:
{
"modelId": "model_abc",
"messages": [{"role": "user", "content": "Extract: iPhone 15 Pro costs $999"}],
"responseFormat": {
"type": "json_schema",
"json_schema": {
"name": "product",
"strict": true,
"schema": {
"type": "object",
"properties": {
"name": {"type": "string"},
"price": {"type": "number"}
},
"required": ["name", "price"]
}
}
}
}
Response content will be valid JSON: {"name": "iPhone 15 Pro", "price": 999}
Define tools the model can invoke:
{
"modelId": "model_abc",
"messages": [{"role": "user", "content": "Weather in Paris?"}],
"tools": [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"]
}
}
}]
}
When the model calls a tool, respond with tool results:
{
"chatId": "chat_xyz",
"messages": [
{"role": "user", "content": "Weather in Paris?"},
{"role": "assistant", "toolCalls": [{"id": "call_1", "function": {"name": "get_weather", "arguments": "{\"location\":\"Paris\"}"}}]},
{"role": "tool", "toolCallId": "call_1", "content": "{\"temp\": 18, \"conditions\": \"sunny\"}"}
]
}
Continuing Conversations
Use chatId to continue an existing chat:
{
"chatId": "chat_xyz789",
"messages": [{"role": "user", "content": "Tell me more"}]
}
Previous messages are automatically included in context.
FAQ
How does streaming work?
The API sends Server-Sent Events (SSE) with incremental content. Each data: line contains a JSON object with delta (new text) and finished (boolean). Parse events as they arrive for real-time display.
What’s the max conversation length?
Limited by the model’s context window. GPT-4o supports 128K tokens. The API automatically truncates old messages if needed.
Are responses cached?
No. Each request generates a fresh completion. For idempotent behavior, use the same Idempotency-Key header.
How do I count tokens before sending?
Use a tokenizer library (tiktoken for OpenAI, anthropic-tokenizer for Claude) to estimate. The response includes actual token counts in usage.