How RAG Works
| Step | Process |
|---|---|
| 1. Upload | Add files or connect external sources |
| 2. Chunk | Documents split into ~250 token segments |
| 3. Embed | Vector embeddings generated for each chunk |
| 4. Search | User query matched against embeddings |
| 5. Retrieve | Top chunks injected into LLM context |
Create Dataset
Upload Documents
Adding documents is a two-step process: upload the file, then associate it with a dataset.Step 1: Upload File
Step 2: Associate with Dataset
Supported Formats
| Format | Extensions | Max Size |
|---|---|---|
| 50MB | ||
| Word | .docx | 50MB |
| Text | .txt, .md | 50MB |
| Data | .csv, .json | 50MB |
External Connectors
Sync content from external data sources.| Connector | Status |
|---|---|
| Google Drive | ✅ Available |
| Notion | ✅ Available |
Connect Google Drive
Link to Model
Connect a dataset to enable RAG:Best Practices
| Do | Avoid |
|---|---|
| Clean formatting before upload | Scanned images without OCR |
| Use descriptive filenames | Duplicate content across files |
| Split large docs into sections | Mixing unrelated topics |
| Group related content | PII or sensitive data |
Specifications
| Spec | Value |
|---|---|
| Max file size | 50MB |
| Chunk size | ~250 tokens |
| Search latency | 40-120ms |
FAQ
What file formats work best?
Markdown and plain text yield the best results. PDFs work well if they’re text-based (not scanned images). Use OCR preprocessing for scanned documents.How often is content re-indexed?
Uploaded files are indexed once at upload. External connectors (Google Drive) sync based on your configuration—typically every 1-24 hours.Can I preview what chunks were created?
Not via API currently. Use the Dashboard → Datasets → View to inspect chunks.How do I improve retrieval quality?
- Use specific, descriptive filenames
- Add summaries at the start of documents
- Remove boilerplate/headers that repeat across pages
- Split very long documents into logical sections