Skip to main content

Available Models

The Available Models endpoint lists the LLM and embedding models available for use with the Agentic RAG endpoint.

Endpoint: GET /list_models

Authentication

All requests require an Authorization header. See the GenAI RAG API overview for full details.

Request

curl https://api-genai.nexla.io/list_models \
-H "Authorization: Basic YOUR_API_KEY"

Response

Content-Type: application/json

Status: 200 OK

Returns an object mapping each provider to its available models. The exact models available depend on your organization's configuration and credential setup.

LLM Models by Provider

OpenAI

ModelMax Tokens
gpt-5.4400,000
gpt-5.2400,000
gpt-5.2-codex400,000
gpt-5.1400,000
gpt-5.4-mini400,000
gpt-5-nano400,000
gpt-5.4-nano400,000
o3-deep-research200,000
o4-mini-deep-research128,000

Anthropic

ModelMax Tokens
claude-sonnet-4-6200,000
claude-opus-4-6200,000
claude-haiku-4-5-20251001200,000

Google

ModelMax Tokens
gemini-3.1-pro-preview1,048,576
gemini-3-pro-preview1,048,576
gemini-3-flash-preview1,048,576
gemini-2.5-flash-lite1,048,576

Azure OpenAI

ModelMax Tokens
gpt-5.2400,000
gpt-5.2-codex400,000
o3-deep-research200,000
o4-mini-deep-research128,000

Azure AI

ModelMax Tokens
Llama-4-Maverick-17B-128E-Instruct135,232
Mistral-Large-3256,000

Mistral

ModelMax Tokens
devstral-2512256,000
magistral-medium-1.2128,000
magistral-small-1.2128,000
mistral-large-2512256,000
mistral-small-latest128,000
ministral-8b-2512128,000

Nvidia

ModelMax Tokens
nvidia/llama-3.1-nemotron-ultra-253b-v1128,000
nvidia/llama-3.3-nemotron-super-49b-v1.5128,000
nvidia/llama-3.1-nemotron-70b-instruct128,000

Together AI (Open Source)

ModelMax Tokens
Qwen/Qwen3.5-397B-A17B256,000
Qwen/Qwen3-VL-32B-Thinking262,144
openai/gpt-oss-120b128,000
openai/gpt-oss-20b128,000
Qwen/Qwen3-235B-A22B-Instruct-2507262,144
Qwen/Qwen3-30B-A3B-Instruct-2507262,144
meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8135,232
meta-llama/Llama-4-Scout-17B-16E-Instruct135,232
meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo64,000
meta-llama/Meta-Llama-3.1-405B-Instruct-Turbo64,000
deepseek-ai/DeepSeek-R1163,840
deepseek-ai/DeepSeek-V3.2131,072

Embedding Models

When embedding_config is omitted from the Agentic RAG request, the embedding model is inferred per-nexset based on the vector store configuration.

OpenAI & Azure

ModelDimensions
text-embedding-3-small1,536
text-embedding-3-large3,072
text-embedding-ada-0021,536

Other Providers

ProviderSupported Models
NvidiaNvidia NIM embedding endpoints
VoyageVoyage AI embedding endpoints

To use a specific embedding model, provide embedding_config:

{
"embedding_config": {
"credential_id": "cred-789",
"model": "text-embedding-3-small"
}
}

Model Inference

When llm_config.model and llm_config.provider are omitted from the Agentic RAG request, the model is inferred from the credential_id. Each credential is associated with a specific provider and model at creation time.

To override the inferred model, explicitly set llm_config.model and/or llm_config.provider:

{
"llm_config": {
"credential_id": "cred-456",
"model": "gpt-5.2",
"provider": "openai"
}
}

Reasoning Configuration

Some models support reasoning or thinking parameters that control how deeply the model reasons before responding:

ProviderModelsParameterValues
OpenAIo3, o4-mini, gpt-5 seriesreasoning_effortminimal, low, medium, high
Anthropicclaude-sonnet-4-6, claude-opus-4-6Extended thinkingBudget tokens (default: 4096)
Googlegemini-3.xthinking_levellow, medium, high

Choosing a Model

Consider these trade-offs when selecting a model:

FactorGuidance
AccuracyLarger models (e.g., gpt-5.4, claude-opus-4-6, gemini-3.1-pro-preview) produce more accurate answers and better handle complex multi-nexset queries
CostSmaller models (e.g., gpt-5-nano, claude-haiku-4-5-20251001, gemini-2.5-flash-lite) are significantly cheaper per token — suitable for high-volume or lower-stakes queries
LatencySmaller models respond faster, especially for streaming use cases
Extended thinkingAnthropic Claude and Gemini 3.x models emit thinking_delta SSE events during streaming, providing visibility into the agent's reasoning
Tool callsAll supported models can perform tool calls, but larger models tend to make better routing decisions when querying multiple nexsets
Context windowGoogle Gemini 3.x models offer up to 1M token context windows — useful for nexsets with very large schemas or many columns