Document Chunking
Before documents can be searched by the Agentic RAG agent, they pass through a chunking pipeline that splits each file into semantically meaningful chunks, enriches them with metadata, and verifies factual accuracy. The quality of these chunks directly affects retrieval precision and answer quality.
This page describes how the pipeline processes documents, what each chunk contains, and what you can configure.
Supported File Types
| File Type | Extensions | Processing Approach |
|---|---|---|
.pdf | Text extraction via pdfplumber. Pages with sparse text but rich visual content (charts, scanned pages) are rendered to images and processed with Gemini vision OCR | |
| CSV | .csv | Parsed into a DataFrame. Rows are grouped into blocks and converted from tabular format into readable prose chunks |
| Excel | .xlsx, .xls | Each sheet is processed independently. Row blocks are chunked with tabular-specific prompts, preserving sheet context |
| Image | .png, .jpg, .jpeg | Sent directly to Gemini vision. Returns a single chunk with the extracted content |
For PDFs, the pipeline profiles each page to determine whether it contains primarily text or visual content (charts, diagrams, scanned text). Pages that fall below a configurable text threshold (default: 200 characters) are processed via vision OCR at 300 DPI, in batches of up to 4 pages per call.
Chunking Strategies
The pipeline supports two chunking strategies. Heading-driven chunking is the default; density-driven chunking activates automatically as a fallback.
Heading-Driven Chunking
The primary strategy. It detects headings throughout the document, classifies them by type and level, and uses heading boundaries to split content into structurally coherent chunks.
How it works:
- Heading detection — An LLM pass identifies headings with their text, hierarchy level (1–6), type (
section,page,bridge,global,footer), and content type - Section building — Content between heading boundaries becomes a section
- Section merging — Single-line sections and sections below the minimum size (default: 400 characters) are merged with adjacent sections
- Size carving — Sections exceeding the maximum size (default: 1,800 characters) are split at natural break points. Up to 4 passes handle outliers
- Parent-child linking — A heading hierarchy is built so child chunks carry their parent section title for context
- Metadata enrichment — Each chunk receives LLM-generated metadata (topic, summary, content type, potential questions)
Density-Driven Chunking
The fallback strategy. It activates when a document block has fewer than 2 boundary headings (section, page, or bridge type), making structural chunking unreliable.
How it works:
- Page grouping — Pages are grouped into blocks by estimated token count, using median absolute deviation (MAD) thresholding to determine target block size
- LLM chunking — Each block is sent to the LLM with a chunking prompt that produces semantically coherent chunks
- Timeout recovery — If a block times out and exceeds 4,000 characters, it is recursively split and retried (max depth: 3)
Chunk Schema
Every chunk produced by the pipeline contains these fields:
Core Fields
| Field | Type | Description |
|---|---|---|
chunk_number | string | Sequential chunk number within the document, starting from "1" |
page_number | string | Source page number (PDF only). Derived from page markers in the extracted text |
sheet_name | string | Source sheet name (Excel only) |
chunk_text | string | The chunk content as well-structured prose. Tables and lists are converted to flowing text with all data preserved |
source_info | string | Source filename |
Metadata Fields
| Field | Type | Description |
|---|---|---|
topic | string | Concise topic title (3–10 words), grounded in the chunk text |
summary | string | 1–2 sentence summary, traceable to the chunk text |
section | string | Parent section heading. For child chunks in a heading hierarchy, this is the enclosing section title |
sub_topics | string | Comma-separated list of sub-topics, or empty |
universal | string | Pipe-separated universal headings (e.g., "Acme Corp | 2024-03-15 | Confidential"). These are headings that apply across the entire document |
section_reference | string | Exact legal or structural citation found in the text (e.g., "Section 3(c)(ii)", "Article 12.4", "Exhibit G"), or empty |
content_type | string | Dominant content type of the chunk. See Content Types |
potential_questions | array | 3–5 specific questions answerable from this chunk's text alone |
customer_specific_tags | array | Custom tags derived from domain-specific instructions, or empty array |
Document-Level Fields
These fields are populated once at the document level and attached to every chunk:
| Field | Type | Description |
|---|---|---|
freshness_date | string | Most relevant date in the document, normalized to YYYY-MM-DD. Supports ISO dates, "DD Month YYYY", "Month YYYY", quarter formats (Q1 2024, 2024-Q2, 1Q24), and year-only |
freshness_date_unix | integer | Unix timestamp of freshness_date |
Structured Data Fields
| Field | Type | Description |
|---|---|---|
structured_tables | string | JSON-encoded array of extracted tables (see Structured Table Extraction). Empty string when no tables are extracted |
Debug Fields
| Field | Type | Description |
|---|---|---|
debug_info | object | Diagnostic metadata. Contains chunker_result ("Success" or "Fail"), file_type, model, and llm_stats with token counts and page-level statistics |
debug_info.llm_stats
| Field | Type | Description |
|---|---|---|
input_tokens_per_file | integer | Total input tokens consumed across all LLM calls for this file |
output_tokens_per_file | integer | Total output tokens generated |
total_tokens_per_file | integer | Sum of input and output tokens |
total_llm_calls_per_file | integer | Number of LLM API calls made |
total_pages | integer | Total pages in the source document (PDF only) |
pages_with_chunks | integer | Pages that produced at least one chunk |
pages_with_zero_chunks | integer | Pages that produced no chunks (e.g., blank pages, headers-only pages) |
Content Types
The content_type field classifies each chunk's dominant content:
| Value | Description |
|---|---|
narrative | Explanatory prose, background information, descriptions |
metrics | Quantitative data, tables, numerical figures, financial data |
instructions | Step-by-step procedures, how-to content |
comparisons | Parallel structures comparing items, side-by-side analyses |
legal | Disclaimers, clauses, formal legal language, terms and conditions |
concept | Term definitions, glossary entries, conceptual explanations |
other | Content that does not fit the categories above |
Metadata Enrichment
Chunks are enriched through a multi-stage pipeline:
- Document metadata — The LLM extracts the file title, freshness date, and a document-level summary from the first pages of the document. Custom instructions can declare additional document-level fields
- Structure planning — A document plan is generated from a sample of pages, identifying the document type, heading hierarchy depth, table density, figure count, and whether scanned pages are suspected
- Chunk-level metadata — After chunking, each chunk is sent through a batch metadata enrichment pass (up to 6 chunks per LLM call) that generates the
topic,summary,sub_topics,section_reference,content_type,potential_questions, andcustomer_specific_tagsfields
Groundedness Verification
When enabled, the pipeline verifies that chunk metadata is faithful to the source text. This catches hallucinated summaries, inaccurate topics, or questions that cannot actually be answered from the chunk.
How it works:
- Sampling — A deterministic sample of chunks is selected (default: 35% of chunks, minimum 6)
- Verification — Each sampled chunk is evaluated by the LLM, which checks whether the summary, topic, content type, and potential questions are grounded in the chunk text. Each chunk receives a severity rating:
ok,minor, ormajor - Drift assessment — The fraction of
majorfailures in the sample is calculated. If drift is below 5%, no retry is needed. If drift exceeds 20% and major failures are present, the retry loop activates - Retry loop — Failed chunks are re-sent for metadata generation with the specific issues flagged. The loop stops when drift improves, or when guardrails are hit:
- Maximum 30 chunks retried per document
- Maximum $0.10 USD cost for the verify-and-retry cycle
- Maximum 180 seconds wall-clock time
- Monotonic improvement required (no improvement = stop)
Verification adds latency and cost but significantly improves metadata quality. For high-stakes document collections where answer accuracy is critical, keep it enabled (the default).
Structured Table Extraction
For chunks classified as metrics or comparisons, the pipeline can extract structured table data as machine-readable JSON.
How it works:
- Candidate selection — Chunks with
content_typeofmetricsorcomparisonsare identified - Batch extraction — Candidate chunks are sent to the LLM in batches of 4, which extracts tables as JSON objects with
title,headers,rows, andnotes - Validation — Each extracted cell value is checked against the source chunk text. Tables where more than 30% of cells are not found in the source are dropped. Tables with 10–30% ungrounded cells are flagged
- Cross-chunk bridging — When a table spans multiple chunks, the pipeline matches tables across chunk boundaries by header similarity (threshold: 75%). Matched continuation rows are merged into the parent table with provenance annotations
The result is stored as a JSON string in the structured_tables field. Each table object has this shape:
{
"title": "Q2 2024 Revenue by Region",
"headers": ["Region", "Revenue", "Growth"],
"rows": [
["West", "$1.8M", "18%"],
["East", "$1.2M", "8%"]
],
"notes": "Source: Q2 Quarterly Report, p. 7"
}
Configuration
The pipeline is configured through environment variables. All settings have sensible defaults.
Strategy and Model
| Variable | Type | Default | Description |
|---|---|---|---|
GEMINI_MODEL | string | gemini-2.5-flash | Primary Gemini model used for chunking, metadata, and verification |
CHUNKER_USE_HEADING_DRIVEN | boolean | true | Enable heading-driven chunking as the primary strategy. When false, all documents use density-driven chunking |
Chunk Sizing
| Variable | Type | Default | Description |
|---|---|---|---|
CHUNK_MAX_CHARS | integer | 1800 | Sections larger than this (in characters) are split via size-aware carving |
CHUNK_MIN_CHARS | integer | 400 | Minimum chunk size. Fragments smaller than this are merged with adjacent content |
Verification
| Variable | Type | Default | Description |
|---|---|---|---|
CHUNKER_VERIFY_CHUNKS | boolean | true | Enable groundedness verification and retry loop |
Structured Tables
| Variable | Type | Default | Description |
|---|---|---|---|
CHUNKER_STRUCTURED_TABLES | boolean | true | Enable structured table extraction for metrics and comparisons chunks |
Boolean environment variables accept 1, true, or yes (case-insensitive) as truthy values. Any other value is treated as false.
Custom Instructions
You can pass domain-specific instructions to customize how documents are chunked and what metadata is extracted. Custom instructions are injected into the LLM prompts at every stage — document metadata extraction, chunking, metadata enrichment, and verification.
What You Can Do
- Guide chunking behavior — Provide context about the document domain (e.g., "These are pharmaceutical trial reports; preserve dosage tables verbatim")
- Declare custom per-chunk fields — Define additional fields to extract by specifying them with a
"field_name":syntax. Field names must be lowercasesnake_case - Declare custom document-level fields — Add fields to the document metadata stage (title, dates, custom identifiers)
Instruction Precedence
When multiple instruction sources are available, the pipeline uses the first non-empty source:
extra_data["instructions"]passed in the API callcustomInstructionsparameter (legacy)- Deployment-level inline instructions (baked into the deployment)
- No custom instructions (generic mode)
Example
Custom instructions like:
These documents are commercial real estate leases. Extract the following per-chunk fields:
"lease_clause_type": The type of lease clause (rent, maintenance, termination, renewal, etc.)
"effective_date_range": The date range this clause applies to, if stated.
Would produce chunks with the standard fields plus lease_clause_type and effective_date_range as additional keys.