Agentic RAG Query

The Agentic RAG endpoint uses an AI agent to search across one or more nexsets, reason over the retrieved data, and generate a natural language answer with inline citations. The agent dynamically decides which nexsets to query, what search terms to use, and how to combine results.

Supports both synchronous JSON responses and real-time Server-Sent Events (SSE) streaming. Multi-turn conversations are supported via session IDs.

Endpoint: POST /v2/agentic-rag

Pipeline Flow

The Agentic RAG pipeline processes each request through these steps:

Verify auth — Validate the Authorization header
Load admin token — Decrypt the service key for downstream Nexla API calls
Parallel request preparation — Resolve LLM credentials, embedding credentials (if provided), and dataset metadata
Dataset queryable check — Determine if each nexset supports SQL, Pinecone, DataFeed, or static fallback
Parallel context enrichment — Resolve filters from user_context and registered schemas, load conversation history, load Pinecone credentials
Build per-nexset filters — Combine ACL filters, access rules/scopes, and pre-retrieval filters
Tool routing — Route each nexset to the appropriate tool (SQL, Pinecone, DataFeed, or static)
Build agent — Construct the agent with the resolved model, system prompt, and nexset tools
Agent reasoning loop — The agent interprets the user's intent, calls relevant tools (first round in parallel), and may perform a second targeted retrieval round (max 2 rounds, 8 tool calls total)
Synthesize and return — Generate the final answer from cross-nexset evidence, attach citations, and return the response

Authentication

All requests require an Authorization header. See the GenAI RAG API overview for full details.

Request

Content-Type: application/json

Top-Level Fields

Field	Type	Required	Default	Description
`user_prompt`	string	Yes	—	The natural language question to answer
`system_prompt`	string	No	`null`	Additional request-specific instructions appended to the built-in V2 agent system prompt. This is additive, not a replacement, and is not persisted across later session turns — resend on follow-ups if the same instruction should keep applying
`nexsets`	array	Yes	—	Nexset IDs to search. Accepts plain string IDs (`["123"]`), integers (`[123]`), or full NexsetSpec objects
`user_context`	UserContext	Yes	—	User identity and access control context
`llm_config`	LLMConfig	Yes	—	LLM credential configuration
`slm_config`	LLMConfig	No	`null`	Optional small-LLM credential used when `mode` is `"instant"`. Same schema as `llm_config`. When omitted in instant mode, the request falls back to `llm_config` with reasoning forced off. Ignored in `thinking` and `deep_research` modes
`embedding_config`	EmbeddingConfig	No	`null`	Embedding model credential configuration. When omitted, the embedding model is inferred from the nexset
`reranker_config`	RerankerConfig	No	`null`	Optional Cohere reranker configuration. When set, Pinecone-backed search tools over-fetch candidates and rerank them via Cohere before returning the top results. See RerankerConfig
`mode`	string	No	`"thinking"`	Execution mode: `"instant"`, `"thinking"`, or `"deep_research"`. Controls reasoning depth, the agent turn budget, and which credential is preferred. See Execution Modes
`stream`	boolean	No	`false`	When `true`, the response is an SSE stream
`debug`	boolean	No	`false`	When `true`, includes diagnostic information in the response. Ignored when `stream: true` — the two modes are mutually exclusive
`cache_policy`	string	No	`default`	Cache behavior: `default` (read + write), `refresh` (skip reads, overwrite), or `bypass` (skip reads and writes)
`skip_cache`	boolean	No	`false`	Legacy bypass flag. When `true`, takes precedence over `cache_policy` and maps to `bypass`

NexsetSpec

Each entry in the nexsets array can be a plain string/integer ID or a full object:

Field	Type	Required	Description
`id`	string	Yes	The unique nexset identifier
`filters`	NexsetFilters	No	Per-nexset filters for access control and pre-retrieval narrowing

NexsetFilters

Field	Type	Required	Description
`acl_filter`	array of FilterCondition	No	Access control filters. Restricts results based on authorization rules. Multiple conditions are combined with AND logic
`pre_filter`	array of FilterCondition	No	Pre-retrieval metadata filters. Narrows the search space before semantic retrieval. Multiple conditions are combined with AND logic

FilterCondition

Field	Type	Required	Description
`key`	string	Yes	The metadata field name to filter on (e.g., `"tenant_id"`, `"document_type"`)
`operator`	string	Yes	The comparison operator. See Filter Operators
`value`	any	Depends on operator	The value(s) to compare against. See Filter Operators for type requirements per operator

UserContext

Field	Type	Required	Default	Description
`user_id`	string	Yes	—	Unique identifier for the requesting user. Overridden by the `user_id` claim when the request is authenticated with a JWT
`session_id`	string	No	`null`	Session ID for multi-turn conversation continuity. See Multi-Turn Conversations
`access_rules`	object	No	`null`	Policy-level access gates. Keys and valid values are defined at filter registration time
`access_scope`	object	No	`null`	Data ownership scope. Values are always treated as arrays with IN semantics
`filters`	object	No	`null`	Page or session context filters. String values use EQ matching; array values use IN matching

LLMConfig

Field	Type	Required	Description
`credential_id`	string	Yes	Credential ID for the LLM
`model`	string	No	Model name override. When omitted, inferred from the credential
`provider`	string	No	Provider override (e.g., `"openai"`, `"anthropic"`, `"google"`, `"azure"`, `"together"`, `"nvidia"`). When omitted, inferred from the credential
`reasoning_effort`	string	No	OpenAI reasoning effort for `gpt-5*` models. One of `"none"`, `"minimal"` (gpt-5 only), `"low"`, `"medium"`, `"high"`, `"xhigh"`. Defaults to `"low"`. Ignored for non-reasoning models and non-OpenAI providers
`reasoning_summary`	string	No	OpenAI reasoning summary verbosity for `gpt-5*` models. One of `"concise"`, `"detailed"`, `"auto"`. Defaults to `"concise"`. Drives the volume of `thinking_delta` events streamed back: `"detailed"` produces visibly streaming multi-paragraph summaries; `"concise"` arrives in a single short burst

EmbeddingConfig

Field	Type	Required	Description
`credential_id`	string	Yes	Credential ID for the embedding model
`model`	string	No	Embedding model name override
`provider`	string	No	Provider override

RerankerConfig

Optional Cohere reranker configuration. When set, Pinecone-backed search tools over-fetch top_k × top_k_multiplier candidates and rerank them via Cohere before returning the top top_k chunks. The reranker is best-effort — any reranker error (timeout, auth, upstream failure) is logged and the request returns the original Pinecone results rather than failing.

Field	Type	Required	Default	Description
`credential_id`	string	Yes	—	Credential ID for the Cohere reranker
`model`	string	Yes	—	Cohere reranking model identifier (e.g., `"rerank-v3.5"`). No default — callers must be explicit about model selection
`provider`	string	No	`"cohere"`	Provider override
`top_k_multiplier`	integer	No	`2`	Pinecone over-fetch multiplier. Must be between 1 and 10. The search tool retrieves `top_k × multiplier` candidates and reranks down to `top_k`
`rank_fields`	array of string	No	`["title", "summary", "text", "metadata"]`	Ordered list of fields packed into the per-chunk JSON document sent to Cohere. Fields absent on a chunk are silently omitted from its document string

Execution Modes

The mode field selects one of three execution profiles. The default, thinking, matches the historical behavior of the endpoint.

Mode	Reasoning effort (default)	Agent turn budget	Credential used
`instant`	`none`	6	`slm_config` when provided; falls back to `llm_config` with reasoning forced off
`thinking` (default)	`low`	10	`llm_config`
`deep_research`	`high`	20	`llm_config`

reasoning_effort and reasoning_summary on the active credential config (llm_config, or slm_config when mode=instant and provided) override the mode default when set. Defaults apply to OpenAI gpt-5* reasoning models; the field is ignored for non-reasoning models and non-OpenAI providers.

mode='deep_research' combined with llm_config.reasoning_effort='none' is rejected with HTTP 422, since deep research requires reasoning. To use deep_research, either omit reasoning_effort (the mode default of high applies) or set it to a non-none value. Callers that want a fast, low-cost path with reasoning fully disabled should use mode='instant' instead.

Response — JSON (Non-Streaming)

Returned when stream is false (default).

Content-Type: application/json

Response Fields

Field	Type	Always Present	Description
`answer`	string	Yes	The generated answer. Contains `[N]` markers referencing entries in the `citations` array
`citations`	array of Citation	Yes	Source metadata for each cited reference
`usage`	Usage	Yes	Token usage statistics for the request
`cost`	Cost or null	Yes	Estimated cost breakdown for the request (LLM + embedding). `null` when cost accounting is unavailable for the selected credentials
`model`	string	Yes	The LLM model name used
`provider`	string	Yes	The LLM provider used
`warnings`	array of Warning	No	Present when recoverable degradations occurred. Absent on fully successful requests
`intermediate_responses`	object	No	Present only when `debug: true`. Contains: `tool_calls` (array of `{tool_name, tool_call_id, display_name, arguments, result}`), `thinking` (string or null), `all_sources` (array), `skipped_filter_keys` (array of strings), and `agent_duration_ms` (integer)

Citation

Field	Type	Description
`index`	integer	1-based citation number matching `[N]` markers in the answer text
`nexset_id`	string	The nexset this citation originated from
`nexset_name`	string	Display name of the source nexset
`nexset_source_type`	string or null	Underlying source type of the nexset (e.g., `FILE`, `SQL_DATABASE`, `STATIC`)
`nexset_connector_type`	string or null	Connector identifier used to reach the source (e.g., `s3`, `postgres`, `snowflake`)
`nexset_tags`	array of string	Free-form tags attached to the nexset. Empty array when none are set
`document_id`	string or null	Document identifier, if available
`source_url`	string or null	URL of the source document, if available
`title`	string or null	Title of the source document. Falls back to nexset name if no title is available
`page_numbers`	array	Page numbers where the cited content appears
`bounding_boxes`	array	Bounding box coordinates for the cited content (e.g., for PDFs)
`relevance_score`	float or null	Semantic similarity score of the best-matching chunk for this source

note

Per-org citation processors may add additional fields (e.g., chunks, citation_id) to each citation. The fields above are the contract guaranteed for every caller; extra per-org fields are additive only and never break the base shape.

Usage

Field	Type	Description
`requests`	integer or null	Number of LLM and tool-call requests made
`tool_calls`	integer or null	Number of tool invocations the agent performed
`input_tokens`	integer or null	Total input tokens consumed
`output_tokens`	integer or null	Total output tokens generated
`cache_read_tokens`	integer or null	Prompt-cache read tokens (for providers that report them, e.g., Anthropic)
`cache_write_tokens`	integer or null	Prompt-cache write tokens
`total_tokens`	integer or null	Sum of input and output tokens
`details`	object or null	Provider-specific token breakdowns, when available

Cost

Field	Type	Description
`llm_cost`	string or null	Estimated LLM cost for the request, as a decimal string in the reported currency
`embedding_cost`	string or null	Estimated embedding cost for the request
`total_cost`	string or null	Sum of LLM and embedding cost
`currency`	string	ISO 4217 currency code (currently `USD`)

Warnings

Non-fatal degradations are surfaced via an optional top-level warnings array. The array is absent on a fully successful request and present (with at least one entry) when any recoverable degradation occurred.

Field	Type	Description
`code`	string	Warning code (see table below)
`message`	string	Human-readable description of the degradation

Code	Meaning
`SESSION_HISTORY_UNAVAILABLE`	Prior-turn conversation history could not be loaded. This request proceeded without it
`SESSION_HISTORY_SAVE_FAILED`	This turn's conversation could not be saved. Subsequent turns in the same session may not include it
`CITATIONS_UNRESOLVED`	The answer contains citation markers (`[1]`, `[2]`, …) but the citations array is empty — citation resolution failed
`USAGE_METRICS_UNAVAILABLE`	Usage and cost metrics could not be computed; the `usage` / `cost` fields may be empty or partial

Example JSON Response

{
  "answer": "The Q2 sales figures show a 12% increase over Q1, reaching $4.2M in total revenue [1]. The Western region contributed the most growth at 18% [2].",
  "citations": [
    {
      "index": 1,
      "nexset_id": "10000",
      "nexset_name": "Sales Reports",
      "nexset_source_type": "FILE",
      "nexset_connector_type": "s3",
      "nexset_tags": ["quarterly", "revenue"],
      "document_id": "doc-q2-2025",
      "source_url": null,
      "title": "Q2 2025 Revenue Summary",
      "page_numbers": [3],
      "bounding_boxes": [],
      "relevance_score": 0.94
    },
    {
      "index": 2,
      "nexset_id": "10000",
      "nexset_name": "Sales Reports",
      "nexset_source_type": "FILE",
      "nexset_connector_type": "s3",
      "nexset_tags": [],
      "document_id": "doc-regional-breakdown",
      "source_url": null,
      "title": "Regional Sales Breakdown",
      "page_numbers": [1, 2],
      "bounding_boxes": [],
      "relevance_score": 0.87
    }
  ],
  "usage": {
    "requests": 3,
    "tool_calls": 2,
    "input_tokens": 1250,
    "output_tokens": 340,
    "cache_read_tokens": 0,
    "cache_write_tokens": 0,
    "total_tokens": 1590,
    "details": null
  },
  "cost": {
    "llm_cost": "0.0147",
    "embedding_cost": "0.0002",
    "total_cost": "0.0149",
    "currency": "USD"
  },
  "model": "gpt-4o",
  "provider": "openai"
}

Response — SSE (Streaming)

Returned when stream is true.

Content-Type: text/event-stream Headers: Cache-Control: no-cache, Connection: keep-alive, X-Accel-Buffering: no

Each event follows the standard SSE format:

event: <event_type>
data: <json_payload>

Event Lifecycle

The typical event sequence for a successful request:

message_start
  [tool_call_start -> tool_call_delta* -> tool_call_result]*   (zero or more tool cycles)
  [thinking_delta]*                                            (interleaved with tool cycles)
  generation_start
  content_block_start
    [content_block_delta | inline_citation]*                    (interleaved text and citations)
  content_block_stop
  citation_block
message_delta
message_stop
response_final

response_final is the last event in every stream (success and error). It is an additive bundle that mirrors the data already carried across content_block_delta (assembled answer text), citation_block (citations), message_delta (stop_reason, usage, cost), and message_stop (warnings). Existing clients that consume the prior event sequence keep working unchanged.

The agent may perform multiple tool-call cycles before generating the final answer. Each cycle queries a different nexset or refines the search.

Event Types

`message_start`

Emitted once at the start of the stream. Contains the message ID and model info.

{
  "type": "message_start",
  "message": {
    "id": "msg_abc123...",
    "type": "message",
    "role": "assistant",
    "content": [],
    "model": "gpt-4o",
    "stop_reason": null,
    "usage": { "input_tokens": null, "output_tokens": null }
  }
}

`tool_call_start`

Emitted when the agent begins a tool call (nexset search).

{
  "type": "tool_call_start",
  "tool_name": "search_sales_reports_10000",
  "tool_call_id": "call_abc123",
  "display_name": "Sales Reports"
}

note

In deployments with extended tool metadata enabled, tool_call_start events may also include input_query (the search query) and tool_args (partial arguments) on a best-effort basis. These fields are optional and should not be relied upon for critical logic.

`tool_call_delta`

Streamed fragments of the tool call arguments.

{
  "type": "tool_call_delta",
  "args_delta": "{\"query\": \"Q2 sales\"}"
}

`tool_call_result`

The result returned by a tool call. Content is serialized as JSON and truncated to approximately 500 KB; if truncation occurs, the server appends a trailing "... [truncated]" marker inside the serialized payload.

{
  "type": "tool_call_result",
  "tool_call_id": "call_abc123",
  "content": { "nexset_id": "10000", "total_results": 5, "chunks": [...] }
}

`thinking_delta`

Model reasoning content (emitted only for models that support extended thinking, e.g., Anthropic Claude with thinking enabled). thinking_delta events can be interleaved with tool-call events — they may appear during tool-call cycles, not only as a contiguous preamble.

{
  "type": "thinking_delta",
  "thinking": "Let me search for Q2 sales data..."
}

`generation_start`

Signals that the agent has finished tool calls and is generating the final answer.

{
  "type": "generation_start"
}

`content_block_start`

Marks the beginning of a text content block.

{
  "type": "content_block_start",
  "index": 0,
  "content_block": { "type": "text", "text": "" }
}

`content_block_delta`

An incremental text fragment of the answer.

{
  "type": "content_block_delta",
  "index": 0,
  "delta": { "type": "text_delta", "text": "The Q2 sales figures show" }
}

`inline_citation`

Emitted when a [N] citation marker is detected in the text stream. Sent only on the first occurrence of each citation index.

{
  "type": "inline_citation",
  "citation_index": 1,
  "source": {
    "index": 1,
    "nexset_id": "10000",
    "nexset_name": "Sales Reports",
    "document_id": "doc-q2-2025",
    "source_url": null,
    "title": "Q2 2025 Revenue Summary",
    "page_numbers": [3],
    "bounding_boxes": [],
    "relevance_score": 0.94
  }
}

`content_block_stop`

Marks the end of a text content block.

{
  "type": "content_block_stop",
  "index": 0
}

`citation_block`

Emitted after all content blocks. Contains the complete list of cited sources.

{
  "type": "citation_block",
  "citations": [
    {
      "index": 1,
      "nexset_id": "10000",
      "nexset_name": "Sales Reports",
      "document_id": "doc-q2-2025",
      "title": "Q2 2025 Revenue Summary",
      "relevance_score": 0.94
    }
  ]
}

`message_delta`

Emitted near the end of the stream. Contains the stop reason and final token usage.

{
  "type": "message_delta",
  "delta": { "stop_reason": "end_turn", "stop_sequence": null },
  "usage": { "input_tokens": 1250, "output_tokens": 340 }
}

note

In deployments with cost accounting enabled, message_delta events also include a cost object matching the Cost schema from the non-streaming response.

stop_reason values:

"end_turn" — Successful completion
"error" — The stream terminated due to an error

`message_stop`

Penultimate event in the stream. The terminal event is response_final.

{
  "type": "message_stop"
}

`response_final`

Last event in every stream. Self-contained snapshot bundling the fully assembled answer text, resolved citations, stop reason, usage, cost, and warnings. Consumers that only need the final answer can read this single event instead of accumulating content_block_delta chunks and correlating message_delta / message_stop.

{
  "type": "response_final",
  "message_id": "msg_abc123...",
  "stop_reason": "end_turn",
  "answer_text": "Q3 revenue was $4.2M [1] driven by ...",
  "citations": [
    { "index": 1, "nexset_id": "10000", "nexset_name": "Sales Reports", "title": "Q3 2025 Revenue Summary" }
  ],
  "usage": { "input_tokens": 1250, "output_tokens": 340, "total_tokens": 1590 },
  "cost": { "llm_cost": "0.0123", "total_cost": "0.0124", "currency": "USD" },
  "warnings": null
}

On error, stop_reason is "error", answer_text is "", and citations is []. cost and warnings are null when not applicable.

`error`

Emitted when an error occurs during streaming.

{
  "type": "error",
  "error": { "type": "server_error", "message": "Agent stream timed out after 300s" }
}

When an error occurs mid-stream, the server attempts to close any open content blocks, emit the error event, and then send message_delta (with stop_reason: "error"), message_stop, and response_final to cleanly terminate the stream.

Streaming Error Types

Mid-stream errors are emitted as SSE error events with a machine-parseable error.type value:

`error_type`	Equivalent HTTP	Meaning
`tool_limit_exceeded`	429	Agent-side: tool-call / request budget hit
`llm_usage_limit_exceeded`	429	LLM provider rate-limited the request
`upstream_llm_error`	502	Upstream LLM/HTTP error other than rate-limit
`agent_run_failed`	500	Unclassified agent execution error
`stream_timeout`	—	SSE stream wall-clock timeout reached (300s)
`queue_failure`	—	Internal SSE event queue failure
`all_tools_failed`	502	Every data-source tool call returned an error

Mid-stream error UX contract

For all_tools_failed and queue_failure, content deltas may have already been flushed to the client before the error event arrives — the agent can produce plausible text from zero evidence, or a queue fault can occur after useful content has streamed. Clients must treat any error event followed by message_delta with stop_reason=error as invalidating the preceding content block. Surface an explicit error to the end user rather than rendering the partial answer, even if the rendered text looks complete.

Filters

Filters control which data the agent can access during retrieval. There are two layers:

ACL Filters (Access Control)

Restrict which records a user is authorized to see. Applied as hard constraints — results that don't match are excluded regardless of relevance. Configured via nexsets[].filters.acl_filter.

Pre-Retrieval Filters

Narrow the search space before semantic retrieval. Used for scoping queries to specific document types, date ranges, or other metadata dimensions. Configured via nexsets[].filters.pre_filter.

Filter Resolution

Filters can be applied in two ways:

Explicit per-nexset filters — Passed directly in nexsets[].filters using FilterCondition objects.
Server-side resolution from user context — When user_context contains access_rules, access_scope, or filters, the server resolves these values against registered filter schemas for each nexset and generates the appropriate filter conditions automatically.

Both sources are merged. Explicit filters and server-resolved filters are combined with AND logic.

Filter Operators

Operator	Value Type	Description
`EQ`	single value	Equals
`NEQ`	single value	Not equals
`IN`	array	Value is in the provided list
`NOT_IN`	array	Value is not in the provided list
`GT`	single value	Greater than
`GTE`	single value	Greater than or equal
`LT`	single value	Less than
`LTE`	single value	Less than or equal
`CONTAINS`	single value	Field contains the value (substring match)
`NOT_CONTAINS`	single value	Field does not contain the value
`EXISTS`	(ignored)	Field exists and is not null
`NOT_EXISTS`	(ignored)	Field does not exist or is null
`BETWEEN`	array of two values `[min, max]`	Value is within the inclusive range

Filter Value Semantics at Query Time

When using server-side resolution (no explicit operator in user_context), the value type determines the operator:

Value Type	Inferred Operator	Example
String	EQ (equals)	`"tenant_id": "17001"`
Array	IN (contains)	`"property_id": ["42001", "99001"]`

Execution Order

Nexla applies filters in deterministic order — the caller does not control this:

access_rules — hard gate, fail fast if role or policy is violated
access_scope — applied as data-level restriction across all relevant nexsets
filters — intersected with access_scope result, applied per nexset based on registration
Embedding / retrieval — vector and/or SQL based on nexset type
LLM generation — answer synthesized from retrieved context

Multi-Turn Conversations

To maintain conversation context across multiple requests, set user_context.session_id to a stable identifier.

The server persists conversation history and includes it in subsequent requests within the same session
Sessions are scoped by the combination of your API key, user_id, and session_id
Sessions expire after 7 days of inactivity
If conversation history storage is unavailable on load, the request proceeds without history (HTTP 200) and the response carries a warnings[] entry with code SESSION_HISTORY_UNAVAILABLE. If a save fails after the answer is generated, the response carries SESSION_HISTORY_SAVE_FAILED and subsequent turns may miss this turn's context
To start a fresh conversation, use a new session_id
system_prompt is request-scoped: it is appended to the built-in V2 agent prompt for this request only and is not persisted in conversation history. Resend it on follow-up turns if the same instruction should keep applying

Cache Management

Two operator-facing endpoints exist to evict server-side cache entries when the upstream data they reference has changed. Both require the same Authorization credential as /v2/agentic-rag.

Cache entries are not partitioned by caller — keys depend only on the resource being cached (nexset_id, credential_id, etc., as listed in the bucket table below). A single invalidation call therefore wipes the entry for every user and every service key that would otherwise have read it; callers do not need to coordinate invalidations per user or per session.

POST /v2/agentic-rag/cache/clear

Clears every nexset-scoped bucket for a single nexset in one shot.

Request body:

Field	Type	Required	Description
`nexset_id`	string	Yes (when `clear_all=false`)	Nexset to clear
`clear_all`	boolean	No	Reserved. Currently rejected with `403` — global clears are not exposed on this endpoint

Response:

{
  "status": "ok",
  "scope": "nexset",
  "nexset_id": "10000",
  "deleted": {
    "pinecone": 3,
    "filter_schema": 1,
    "normalization_map": 1,
    "dataset_info": 1
  }
}

The pinecone_partial: true flag may appear when a Redis SCAN-based delete was truncated.

POST /v2/agentic-rag/cache/invalidate

Targets specific buckets rather than every nexset-scoped bucket.

Request body:

Field	Type	Required	Description
`buckets`	array	Yes	One or more bucket names from the table below
`nexset_id`	string	Conditional	Required when `buckets` contains `pinecone`, `filter_schema`, `normalization_map`, or `dataset_info`
`credential_id`	string	Conditional	Required when `buckets` contains `credentials`
`credential_mode`	string	No	`llm` or `embedding`. When omitted, both modes are cleared for the given `credential_id`

Cache buckets:

Bucket	Scope Key	Description
`credentials`	`credential_id` (+ optional `credential_mode`)	Cached LLM/embedding credential resolutions
`pinecone`	`nexset_id`	Cached Pinecone query results for the nexset (pattern delete)
`filter_schema`	`nexset_id`	Cached filter schema rows for the nexset
`normalization_map`	`nexset_id`	Cached field normalization map for the nexset
`dataset_info`	`nexset_id`	Cached dataset metadata (name, schema, source/connector type) used by the resolver

Response:

{
  "status": "ok",
  "deleted": {
    "filter_schema": 1,
    "dataset_info": 1
  }
}

Unsupported bucket names are rejected with HTTP 400.

Cache Policy

The cache_policy request field on POST /v2/agentic-rag controls how the server uses its execution caches for a single request:

Value	Behavior
`default`	Read from cache when present; write fresh entries on misses. Standard production behavior
`refresh`	Skip cache reads, force live fetches, then overwrite the cache with the fresh result. Use after upstream data has changed but you do not want to manually invalidate
`bypass`	Skip cache reads and writes. Each cache lookup is a live fetch and nothing is persisted. Use for one-off debugging or for callers that should never touch shared cache state

The legacy skip_cache: true flag is still accepted and maps to bypass. It takes precedence over cache_policy if both are provided.

Limits and Timeouts

Limit	Value	Description
Tool calls per request	8	Maximum number of search operations the agent can perform in a single request. Exceeding this raises HTTP 429 (non-streaming) or emits an SSE `error` event with `error_type=tool_limit_exceeded` (streaming)
Stream timeout	300 seconds	Maximum wall-clock time for an SSE stream. The stream is terminated with `error_type=stream_timeout` if no events are produced within this window

Error Reference

Error responses are returned as standard HTTP error responses with a JSON body:

{
  "detail": "Error message describing the issue"
}

Status	Condition	Detail	Retryable
400	No nexsets provided	`"At least one nexset is required"`	No
401	Missing or invalid Authorization header; JWT missing `org_id`	`"org_id missing from JWT claims"` (JWT case)	No
403	LLM credentials could not be resolved to a valid API key	`"LLM credentials required: credential_id must resolve to valid API key"`	No
422	Model name cannot be determined from the credential or request; `user_prompt` empty or whitespace-only	Varies	No
422	`mode='deep_research'` combined with `llm_config.reasoning_effort='none'`	`"mode='deep_research' is incompatible with llm_config.reasoning_effort='none'. Remove reasoning_effort or use mode='instant'."`	No
429	Agent exceeded its tool-call / request budget (agent-side limit)	`"Agent exceeded tool-call / request limit"`	Yes — split the query or reduce scope
429	LLM provider rate-limited the request (provider-side quota)	`"LLM provider rate-limited the request"` (passes through upstream `Retry-After` when present)	Yes — honor `Retry-After`
500	Internal server error (e.g., credential resolution failure, unclassified agent run error)	Varies	Maybe
502	Upstream LLM provider error during agent run; OR all data-source tools failed for this request	Varies	Maybe
503	Filter enforcement service unavailable, or any nexset's filter schema could not be loaded (fail-closed to prevent unauthorized access)	`"Server-side filter enforcement is temporarily unavailable..."`	Yes

Retry Guidance

There is no explicit retryable field in error responses. Use standard HTTP semantics:

400, 401, 403, 422 — client-side issues. Fix the request before retrying
429 — rate-limited. Both agent-side ("Agent exceeded tool-call / request limit") and provider-side ("LLM provider rate-limited the request") are retryable. Honor the Retry-After header if present. For the agent-side case, consider splitting the query
500, 502 — may be transient. Retry with exponential backoff
503 — explicitly transient. Retry after a short delay

For streaming responses, errors that occur after the stream has opened are delivered as typed error SSE events rather than HTTP status codes (HTTP 200 has already been sent). The stream terminates cleanly with message_delta (stop_reason=error), message_stop, and response_final (stop_reason=error, empty answer_text). See Streaming Error Types for the typed error_type values.

Examples

Minimal Request (Non-Streaming)

curl -X POST https://api-genai.nexla.io/v2/agentic-rag \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic YOUR_API_KEY" \
  -d '{
    "user_prompt": "What are the latest sales figures for Q2?",
    "system_prompt": "Focus on a concise executive summary.",
    "nexsets": ["10000", "10001"],
    "user_context": {
      "user_id": "user-123"
    },
    "llm_config": {
      "credential_id": "cred-456"
    }
  }'

Request with Filters and Streaming

curl -N -X POST https://api-genai.nexla.io/v2/agentic-rag \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic YOUR_API_KEY" \
  -d '{
    "user_prompt": "Find lease renewal terms for building A",
    "system_prompt": "Return the result as short bullet points.",
    "nexsets": [
      {
        "id": "10000",
        "filters": {
          "acl_filter": [
            { "key": "tenant_id", "operator": "EQ", "value": "tenant-1" }
          ],
          "pre_filter": [
            { "key": "document_type", "operator": "EQ", "value": "lease" }
          ]
        }
      }
    ],
    "user_context": {
      "user_id": "user-123",
      "session_id": "session-abc",
      "access_rules": { "tenant_id": "tenant-1" }
    },
    "llm_config": {
      "credential_id": "cred-456",
      "model": "gpt-4o",
      "provider": "openai"
    },
    "stream": true
  }'

Multi-Turn Conversation

Send the first question:

curl -X POST https://api-genai.nexla.io/v2/agentic-rag \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic YOUR_API_KEY" \
  -d '{
    "user_prompt": "What were Q2 sales?",
    "system_prompt": "Respond in executive-summary style.",
    "nexsets": ["10000"],
    "user_context": {
      "user_id": "user-123",
      "session_id": "conv-001"
    },
    "llm_config": { "credential_id": "cred-456" }
  }'

Then ask a follow-up using the same session_id:

curl -X POST https://api-genai.nexla.io/v2/agentic-rag \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic YOUR_API_KEY" \
  -d '{
    "user_prompt": "How does that compare to Q1?",
    "nexsets": ["10000"],
    "user_context": {
      "user_id": "user-123",
      "session_id": "conv-001"
    },
    "llm_config": { "credential_id": "cred-456" }
  }'

The agent will have access to the conversation history and understand "that" refers to Q2 sales. If you want the same request-specific system_prompt to apply on the follow-up, include it again in the second request.

Deep Research with Reranker

curl -X POST https://api-genai.nexla.io/v2/agentic-rag \
  -H "Content-Type: application/json" \
  -H "Authorization: Basic YOUR_API_KEY" \
  -d '{
    "user_prompt": "Compare our renewable energy contracts across regions and highlight unusual termination clauses.",
    "nexsets": ["10000", "10001"],
    "user_context": { "user_id": "user-123" },
    "llm_config": {
      "credential_id": "cred-456",
      "model": "gpt-5",
      "provider": "openai"
    },
    "reranker_config": {
      "credential_id": "cred-cohere-789",
      "model": "rerank-v3.5",
      "top_k_multiplier": 3
    },
    "mode": "deep_research"
  }'

Example SSE Stream

event: message_start
data: {"type":"message_start","message":{"id":"msg_a1b2c3","type":"message","role":"assistant","content":[],"model":"gpt-4o","stop_reason":null,"usage":{"input_tokens":null,"output_tokens":null}}}

event: tool_call_start
data: {"type":"tool_call_start","tool_name":"search_sales_reports_10000","tool_call_id":"call_x1","display_name":"Sales Reports"}

event: tool_call_delta
data: {"type":"tool_call_delta","args_delta":"{\"query\":\"Q2 sales figures\"}"}

event: tool_call_result
data: {"type":"tool_call_result","tool_call_id":"call_x1","content":{"nexset_id":"10000","total_results":5,"chunks":[{"text":"Q2 revenue reached $4.2M...","citation_index":1}]}}

event: generation_start
data: {"type":"generation_start"}

event: content_block_start
data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"The Q2 sales figures show a 12% increase "}}

event: content_block_delta
data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"over Q1, reaching $4.2M in total revenue [1]."}}

event: inline_citation
data: {"type":"inline_citation","citation_index":1,"source":{"index":1,"nexset_id":"10000","nexset_name":"Sales Reports","document_id":"doc-q2-2025","source_url":null,"title":"Q2 2025 Revenue Summary","page_numbers":[3],"bounding_boxes":[],"relevance_score":0.94}}

event: content_block_stop
data: {"type":"content_block_stop","index":0}

event: citation_block
data: {"type":"citation_block","citations":[{"index":1,"nexset_id":"10000","nexset_name":"Sales Reports","document_id":"doc-q2-2025","source_url":null,"title":"Q2 2025 Revenue Summary","page_numbers":[3],"bounding_boxes":[],"relevance_score":0.94}]}

event: message_delta
data: {"type":"message_delta","delta":{"stop_reason":"end_turn","stop_sequence":null},"usage":{"input_tokens":1250,"output_tokens":340}}

event: message_stop
data: {"type":"message_stop"}

event: response_final
data: {"type":"response_final","message_id":"msg_a1b2c3","stop_reason":"end_turn","answer_text":"The Q2 sales figures show a 12% increase over Q1, reaching $4.2M in total revenue [1].","citations":[{"index":1,"nexset_id":"10000","nexset_name":"Sales Reports","title":"Q2 2025 Revenue Summary"}],"usage":{"input_tokens":1250,"output_tokens":340,"total_tokens":1590},"cost":{"llm_cost":"0.0123","total_cost":"0.0124","currency":"USD"},"warnings":null}

Pipeline Flow​

Authentication​

Request​

Top-Level Fields​

NexsetSpec​

NexsetFilters​

FilterCondition​

UserContext​

LLMConfig​

EmbeddingConfig​

RerankerConfig​

Execution Modes​

Response — JSON (Non-Streaming)​

Response Fields​

Citation​

Usage​

Cost​

Warnings​

Example JSON Response​

Response — SSE (Streaming)​

Event Lifecycle​

Event Types​

message_start​

tool_call_start​

tool_call_delta​

tool_call_result​

thinking_delta​

generation_start​

content_block_start​

content_block_delta​

inline_citation​

content_block_stop​

citation_block​

message_delta​

message_stop​

response_final​

error​

Streaming Error Types​

Filters​

ACL Filters (Access Control)​

Pre-Retrieval Filters​

Filter Resolution​

Filter Operators​

Filter Value Semantics at Query Time​

Execution Order​

Multi-Turn Conversations​

Cache Management​

POST /v2/agentic-rag/cache/clear​

POST /v2/agentic-rag/cache/invalidate​

Cache Policy​

Limits and Timeouts​

Error Reference​

Retry Guidance​

Examples​

Minimal Request (Non-Streaming)​

Request with Filters and Streaming​

Multi-Turn Conversation​

Deep Research with Reranker​

Example SSE Stream​

Pipeline Flow

Authentication

Request

Top-Level Fields

NexsetSpec

NexsetFilters

FilterCondition

UserContext

LLMConfig

EmbeddingConfig

RerankerConfig

Execution Modes

Response — JSON (Non-Streaming)

Response Fields

Citation

Usage

Cost

Warnings

Example JSON Response

Response — SSE (Streaming)

Event Lifecycle

Event Types

`message_start`

`tool_call_start`

`tool_call_delta`

`tool_call_result`

`thinking_delta`

`generation_start`

`content_block_start`

`content_block_delta`

`inline_citation`

`content_block_stop`

`citation_block`

`message_delta`

`message_stop`

`response_final`

`error`

Streaming Error Types

Filters

ACL Filters (Access Control)

Pre-Retrieval Filters

Filter Resolution

Filter Operators

Filter Value Semantics at Query Time

Execution Order

Multi-Turn Conversations

Cache Management

POST /v2/agentic-rag/cache/clear

POST /v2/agentic-rag/cache/invalidate

Cache Policy

Limits and Timeouts

Error Reference

Retry Guidance

Examples

Minimal Request (Non-Streaming)

Request with Filters and Streaming

Multi-Turn Conversation

Deep Research with Reranker

Example SSE Stream