Skip to main content

Quick Start

curl -X POST https://api.cuadra.ai/v1/chats \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "modelId": "model_abc123",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Streaming Responses

Enable stream: true for real-time responses via Server-Sent Events:
curl -X POST https://api.cuadra.ai/v1/chats \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"modelId": "model_abc", "messages": [...], "stream": true}'
Stream format:
data: {"id":"chat_xyz","delta":"Once","finished":false}
data: {"id":"chat_xyz","delta":" upon","finished":false}
data: {"id":"chat_xyz","delta":"","finished":true,"usage":{...}}
data: [DONE]

AI SDK Format

For Vercel AI SDK compatibility, add the header:
X-Stream-Format: ai-sdk
Events: start, text-delta, source-document, reasoning-delta, tool-input-delta, finish

Reasoning (Extended Thinking)

Enable enableReasoning: true to see the model’s thinking process. Supported by Claude (Sonnet/Opus), OpenAI o1/o3, and Gemini thinking models.
{
  "modelId": "model_claude",
  "messages": [...],
  "enableReasoning": true,
  "reasoningBudget": 10000
}
Reasoning tokens are billed separately. Use reasoningBudget to cap costs.

Structured Outputs (JSON Mode)

Force JSON schema compliance with responseFormat:
{
  "modelId": "model_abc",
  "messages": [{"role": "user", "content": "Extract: iPhone 15 Pro costs $999"}],
  "responseFormat": {
    "type": "json_schema",
    "json_schema": {
      "name": "product",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "name": {"type": "string"},
          "price": {"type": "number"}
        },
        "required": ["name", "price"]
      }
    }
  }
}
Response content will be valid JSON: {"name": "iPhone 15 Pro", "price": 999}

Tool Calling (Function Calling)

Define tools the model can invoke:
{
  "modelId": "model_abc",
  "messages": [{"role": "user", "content": "Weather in Paris?"}],
  "tools": [{
    "type": "function",
    "function": {
      "name": "get_weather",
      "description": "Get current weather",
      "parameters": {
        "type": "object",
        "properties": {"location": {"type": "string"}},
        "required": ["location"]
      }
    }
  }]
}
When the model calls a tool, respond with tool results:
{
  "chatId": "chat_xyz",
  "messages": [
    {"role": "user", "content": "Weather in Paris?"},
    {"role": "assistant", "toolCalls": [{"id": "call_1", "function": {"name": "get_weather", "arguments": "{\"location\":\"Paris\"}"}}]},
    {"role": "tool", "toolCallId": "call_1", "content": "{\"temp\": 18, \"conditions\": \"sunny\"}"}
  ]
}

Continuing Conversations

Use chatId to continue an existing chat:
{
  "chatId": "chat_xyz789",
  "messages": [{"role": "user", "content": "Tell me more"}]
}
Previous messages are automatically included in context.

FAQ

How does streaming work?

The API sends Server-Sent Events (SSE) with incremental content. Each data: line contains a JSON object with delta (new text) and finished (boolean). Parse events as they arrive for real-time display.

What’s the max conversation length?

Limited by the model’s context window. GPT-4o supports 128K tokens. The API automatically truncates old messages if needed.

Are responses cached?

No. Each request generates a fresh completion. For idempotent behavior, use the same Idempotency-Key header.

How do I count tokens before sending?

Use a tokenizer library (tiktoken for OpenAI, anthropic-tokenizer for Claude) to estimate. The response includes actual token counts in usage.