Available Models
The Available Models endpoint lists the LLM and embedding models available for use with the Agentic RAG endpoint.
Endpoint: GET /list_models
Authentication
All requests require an Authorization header. See the GenAI RAG API overview for full details.
Request
curl https://api-genai.nexla.io/list_models \
-H "Authorization: Basic YOUR_API_KEY"
Response
Content-Type: application/json
Status: 200 OK
Returns an object mapping each provider to its available models. The exact models available depend on your organization's configuration and credential setup.
LLM Models by Provider
OpenAI
| Model | Max Tokens |
|---|---|
| gpt-5.4 | 400,000 |
| gpt-5.2 | 400,000 |
| gpt-5.2-codex | 400,000 |
| gpt-5.1 | 400,000 |
| gpt-5.4-mini | 400,000 |
| gpt-5-nano | 400,000 |
| gpt-5.4-nano | 400,000 |
| o3-deep-research | 200,000 |
| o4-mini-deep-research | 128,000 |
Anthropic
| Model | Max Tokens |
|---|---|
| claude-sonnet-4-6 | 200,000 |
| claude-opus-4-6 | 200,000 |
| claude-haiku-4-5-20251001 | 200,000 |
Google
| Model | Max Tokens |
|---|---|
| gemini-3.1-pro-preview | 1,048,576 |
| gemini-3-pro-preview | 1,048,576 |
| gemini-3-flash-preview | 1,048,576 |
| gemini-2.5-flash-lite | 1,048,576 |
Azure OpenAI
| Model | Max Tokens |
|---|---|
| gpt-5.2 | 400,000 |
| gpt-5.2-codex | 400,000 |
| o3-deep-research | 200,000 |
| o4-mini-deep-research | 128,000 |
Azure AI
| Model | Max Tokens |
|---|---|
| Llama-4-Maverick-17B-128E-Instruct | 135,232 |
| Mistral-Large-3 | 256,000 |
Mistral
| Model | Max Tokens |
|---|---|
| devstral-2512 | 256,000 |
| magistral-medium-1.2 | 128,000 |
| magistral-small-1.2 | 128,000 |
| mistral-large-2512 | 256,000 |
| mistral-small-latest | 128,000 |
| ministral-8b-2512 | 128,000 |
Nvidia
| Model | Max Tokens |
|---|---|
| nvidia/llama-3.1-nemotron-ultra-253b-v1 | 128,000 |
| nvidia/llama-3.3-nemotron-super-49b-v1.5 | 128,000 |
| nvidia/llama-3.1-nemotron-70b-instruct | 128,000 |
Together AI (Open Source)
| Model | Max Tokens |
|---|---|
| Qwen/Qwen3.5-397B-A17B | 256,000 |
| Qwen/Qwen3-VL-32B-Thinking | 262,144 |
| openai/gpt-oss-120b | 128,000 |
| openai/gpt-oss-20b | 128,000 |
| Qwen/Qwen3-235B-A22B-Instruct-2507 | 262,144 |
| Qwen/Qwen3-30B-A3B-Instruct-2507 | 262,144 |
| meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8 | 135,232 |
| meta-llama/Llama-4-Scout-17B-16E-Instruct | 135,232 |
| meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo | 64,000 |
| meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo | 64,000 |
| deepseek-ai/DeepSeek-R1 | 163,840 |
| deepseek-ai/DeepSeek-V3.2 | 131,072 |
Embedding Models
When embedding_config is omitted from the Agentic RAG request, the embedding model is inferred per-nexset based on the vector store configuration.
OpenAI & Azure
| Model | Dimensions |
|---|---|
| text-embedding-3-small | 1,536 |
| text-embedding-3-large | 3,072 |
| text-embedding-ada-002 | 1,536 |
Other Providers
| Provider | Supported Models |
|---|---|
| Nvidia | Nvidia NIM embedding endpoints |
| Voyage | Voyage AI embedding endpoints |
To use a specific embedding model, provide embedding_config:
{
"embedding_config": {
"credential_id": "cred-789",
"model": "text-embedding-3-small"
}
}
Model Inference
When llm_config.model and llm_config.provider are omitted from the Agentic RAG request, the model is inferred from the credential_id. Each credential is associated with a specific provider and model at creation time.
To override the inferred model, explicitly set llm_config.model and/or llm_config.provider:
{
"llm_config": {
"credential_id": "cred-456",
"model": "gpt-5.2",
"provider": "openai"
}
}
Reasoning Configuration
Some models support reasoning or thinking parameters that control how deeply the model reasons before responding:
| Provider | Models | Parameter | Values |
|---|---|---|---|
| OpenAI | o3, o4-mini, gpt-5 series | reasoning_effort | minimal, low, medium, high |
| Anthropic | claude-sonnet-4-6, claude-opus-4-6 | Extended thinking | Budget tokens (default: 4096) |
| gemini-3.x | thinking_level | low, medium, high |
Choosing a Model
Consider these trade-offs when selecting a model:
| Factor | Guidance |
|---|---|
| Accuracy | Larger models (e.g., gpt-5.4, claude-opus-4-6, gemini-3.1-pro-preview) produce more accurate answers and better handle complex multi-nexset queries |
| Cost | Smaller models (e.g., gpt-5-nano, claude-haiku-4-5-20251001, gemini-2.5-flash-lite) are significantly cheaper per token — suitable for high-volume or lower-stakes queries |
| Latency | Smaller models respond faster, especially for streaming use cases |
| Extended thinking | Anthropic Claude and Gemini 3.x models emit thinking_delta SSE events during streaming, providing visibility into the agent's reasoning |
| Tool calls | All supported models can perform tool calls, but larger models tend to make better routing decisions when querying multiple nexsets |
| Context window | Google Gemini 3.x models offer up to 1M token context windows — useful for nexsets with very large schemas or many columns |