Document Chunking

Before documents can be searched by the Agentic RAG agent, they pass through a chunking pipeline that splits each file into semantically meaningful chunks, enriches them with metadata, and verifies factual accuracy. The quality of these chunks directly affects retrieval precision and answer quality.

This page describes how the pipeline processes documents, what each chunk contains, and what you can configure.

Supported File Types

File Type	Extensions	Processing Approach
PDF	`.pdf`	Text extraction via pdfplumber. Pages with sparse text but rich visual content (charts, scanned pages) are rendered to images and processed with Gemini vision OCR
CSV	`.csv`	Parsed into a DataFrame. Rows are grouped into blocks and converted from tabular format into readable prose chunks
Excel	`.xlsx`, `.xls`	Each sheet is processed independently. Row blocks are chunked with tabular-specific prompts, preserving sheet context
Image	`.png`, `.jpg`, `.jpeg`	Sent directly to Gemini vision. Returns a single chunk with the extracted content

note

For PDFs, the pipeline profiles each page to determine whether it contains primarily text or visual content (charts, diagrams, scanned text). Pages that fall below a configurable text threshold (default: 200 characters) are processed via vision OCR at 300 DPI, in batches of up to 4 pages per call.

Chunking Strategies

The pipeline supports two chunking strategies. Heading-driven chunking is the default; density-driven chunking activates automatically as a fallback.

Heading-Driven Chunking

The primary strategy. It detects headings throughout the document, classifies them by type and level, and uses heading boundaries to split content into structurally coherent chunks.

How it works:

Heading detection — An LLM pass identifies headings with their text, hierarchy level (1–6), type (section, page, bridge, global, footer), and content type
Section building — Content between heading boundaries becomes a section
Section merging — Single-line sections and sections below the minimum size (default: 400 characters) are merged with adjacent sections
Size carving — Sections exceeding the maximum size (default: 1,800 characters) are split at natural break points. Up to 4 passes handle outliers
Parent-child linking — A heading hierarchy is built so child chunks carry their parent section title for context
Metadata enrichment — Each chunk receives LLM-generated metadata (topic, summary, content type, potential questions)

Density-Driven Chunking

The fallback strategy. It activates when a document block has fewer than 2 boundary headings (section, page, or bridge type), making structural chunking unreliable.

How it works:

Page grouping — Pages are grouped into blocks by estimated token count, using median absolute deviation (MAD) thresholding to determine target block size
LLM chunking — Each block is sent to the LLM with a chunking prompt that produces semantically coherent chunks
Timeout recovery — If a block times out and exceeds 4,000 characters, it is recursively split and retried (max depth: 3)

Chunk Schema

Every chunk produced by the pipeline contains these fields:

Core Fields

Field	Type	Description
`chunk_number`	string	Sequential chunk number within the document, starting from `"1"`
`page_number`	string	Source page number (PDF only). Derived from page markers in the extracted text
`sheet_name`	string	Source sheet name (Excel only)
`chunk_text`	string	The chunk content as well-structured prose. Tables and lists are converted to flowing text with all data preserved
`source_info`	string	Source filename

Metadata Fields

Field	Type	Description
`topic`	string	Concise topic title (3–10 words), grounded in the chunk text
`summary`	string	1–2 sentence summary, traceable to the chunk text
`section`	string	Parent section heading. For child chunks in a heading hierarchy, this is the enclosing section title
`sub_topics`	string	Comma-separated list of sub-topics, or empty
`universal`	string	Pipe-separated universal headings (e.g., `"Acme Corp \| 2024-03-15 \| Confidential"`). These are headings that apply across the entire document
`section_reference`	string	Exact legal or structural citation found in the text (e.g., `"Section 3(c)(ii)"`, `"Article 12.4"`, `"Exhibit G"`), or empty
`content_type`	string	Dominant content type of the chunk. See Content Types
`potential_questions`	array	3–5 specific questions answerable from this chunk's text alone
`customer_specific_tags`	array	Custom tags derived from domain-specific instructions, or empty array

Document-Level Fields

These fields are populated once at the document level and attached to every chunk:

Field	Type	Description
`freshness_date`	string	Most relevant date in the document, normalized to `YYYY-MM-DD`. Supports ISO dates, `"DD Month YYYY"`, `"Month YYYY"`, quarter formats (`Q1 2024`, `2024-Q2`, `1Q24`), and year-only
`freshness_date_unix`	integer	Unix timestamp of `freshness_date`

Structured Data Fields

Field	Type	Description
`structured_tables`	string	JSON-encoded array of extracted tables (see Structured Table Extraction). Empty string when no tables are extracted

Debug Fields

Field	Type	Description
`debug_info`	object	Diagnostic metadata. Contains `chunker_result` (`"Success"` or `"Fail"`), `file_type`, `model`, and `llm_stats` with token counts and page-level statistics

`debug_info.llm_stats`

Field	Type	Description
`input_tokens_per_file`	integer	Total input tokens consumed across all LLM calls for this file
`output_tokens_per_file`	integer	Total output tokens generated
`total_tokens_per_file`	integer	Sum of input and output tokens
`total_llm_calls_per_file`	integer	Number of LLM API calls made
`total_pages`	integer	Total pages in the source document (PDF only)
`pages_with_chunks`	integer	Pages that produced at least one chunk
`pages_with_zero_chunks`	integer	Pages that produced no chunks (e.g., blank pages, headers-only pages)

Content Types

The content_type field classifies each chunk's dominant content:

Value	Description
`narrative`	Explanatory prose, background information, descriptions
`metrics`	Quantitative data, tables, numerical figures, financial data
`instructions`	Step-by-step procedures, how-to content
`comparisons`	Parallel structures comparing items, side-by-side analyses
`legal`	Disclaimers, clauses, formal legal language, terms and conditions
`concept`	Term definitions, glossary entries, conceptual explanations
`other`	Content that does not fit the categories above

Metadata Enrichment

Chunks are enriched through a multi-stage pipeline:

Document metadata — The LLM extracts the file title, freshness date, and a document-level summary from the first pages of the document. Custom instructions can declare additional document-level fields
Structure planning — A document plan is generated from a sample of pages, identifying the document type, heading hierarchy depth, table density, figure count, and whether scanned pages are suspected
Chunk-level metadata — After chunking, each chunk is sent through a batch metadata enrichment pass (up to 6 chunks per LLM call) that generates the topic, summary, sub_topics, section_reference, content_type, potential_questions, and customer_specific_tags fields

Groundedness Verification

When enabled, the pipeline verifies that chunk metadata is faithful to the source text. This catches hallucinated summaries, inaccurate topics, or questions that cannot actually be answered from the chunk.

How it works:

Sampling — A deterministic sample of chunks is selected (default: 35% of chunks, minimum 6)
Verification — Each sampled chunk is evaluated by the LLM, which checks whether the summary, topic, content type, and potential questions are grounded in the chunk text. Each chunk receives a severity rating: ok, minor, or major
Drift assessment — The fraction of major failures in the sample is calculated. If drift is below 5%, no retry is needed. If drift exceeds 20% and major failures are present, the retry loop activates
Retry loop — Failed chunks are re-sent for metadata generation with the specific issues flagged. The loop stops when drift improves, or when guardrails are hit:
- Maximum 30 chunks retried per document
- Maximum $0.10 USD cost for the verify-and-retry cycle
- Maximum 180 seconds wall-clock time
- Monotonic improvement required (no improvement = stop)

tip

Verification adds latency and cost but significantly improves metadata quality. For high-stakes document collections where answer accuracy is critical, keep it enabled (the default).

Structured Table Extraction

For chunks classified as metrics or comparisons, the pipeline can extract structured table data as machine-readable JSON.

How it works:

Candidate selection — Chunks with content_type of metrics or comparisons are identified
Batch extraction — Candidate chunks are sent to the LLM in batches of 4, which extracts tables as JSON objects with title, headers, rows, and notes
Validation — Each extracted cell value is checked against the source chunk text. Tables where more than 30% of cells are not found in the source are dropped. Tables with 10–30% ungrounded cells are flagged
Cross-chunk bridging — When a table spans multiple chunks, the pipeline matches tables across chunk boundaries by header similarity (threshold: 75%). Matched continuation rows are merged into the parent table with provenance annotations

The result is stored as a JSON string in the structured_tables field. Each table object has this shape:

{
  "title": "Q2 2024 Revenue by Region",
  "headers": ["Region", "Revenue", "Growth"],
  "rows": [
    ["West", "$1.8M", "18%"],
    ["East", "$1.2M", "8%"]
  ],
  "notes": "Source: Q2 Quarterly Report, p. 7"
}

Configuration

The pipeline is configured through environment variables. All settings have sensible defaults.

Strategy and Model

Variable	Type	Default	Description
`GEMINI_MODEL`	string	`gemini-2.5-flash`	Primary Gemini model used for chunking, metadata, and verification
`CHUNKER_USE_HEADING_DRIVEN`	boolean	`true`	Enable heading-driven chunking as the primary strategy. When `false`, all documents use density-driven chunking

Chunk Sizing

Variable	Type	Default	Description
`CHUNK_MAX_CHARS`	integer	`1800`	Sections larger than this (in characters) are split via size-aware carving
`CHUNK_MIN_CHARS`	integer	`400`	Minimum chunk size. Fragments smaller than this are merged with adjacent content

Verification

Variable	Type	Default	Description
`CHUNKER_VERIFY_CHUNKS`	boolean	`true`	Enable groundedness verification and retry loop

Structured Tables

Variable	Type	Default	Description
`CHUNKER_STRUCTURED_TABLES`	boolean	`true`	Enable structured table extraction for `metrics` and `comparisons` chunks

note

Boolean environment variables accept 1, true, or yes (case-insensitive) as truthy values. Any other value is treated as false.

Custom Instructions

You can pass domain-specific instructions to customize how documents are chunked and what metadata is extracted. Custom instructions are injected into the LLM prompts at every stage — document metadata extraction, chunking, metadata enrichment, and verification.

What You Can Do

Guide chunking behavior — Provide context about the document domain (e.g., "These are pharmaceutical trial reports; preserve dosage tables verbatim")
Declare custom per-chunk fields — Define additional fields to extract by specifying them with a "field_name": syntax. Field names must be lowercase snake_case
Declare custom document-level fields — Add fields to the document metadata stage (title, dates, custom identifiers)

Instruction Precedence

When multiple instruction sources are available, the pipeline uses the first non-empty source:

extra_data["instructions"] passed in the API call
customInstructions parameter (legacy)
Deployment-level inline instructions (baked into the deployment)
No custom instructions (generic mode)

Example

Custom instructions like:

These documents are commercial real estate leases. Extract the following per-chunk fields:
"lease_clause_type": The type of lease clause (rent, maintenance, termination, renewal, etc.)
"effective_date_range": The date range this clause applies to, if stated.

Would produce chunks with the standard fields plus lease_clause_type and effective_date_range as additional keys.

Supported File Types​

Chunking Strategies​

Heading-Driven Chunking​

Density-Driven Chunking​

Chunk Schema​

Core Fields​

Metadata Fields​

Document-Level Fields​

Structured Data Fields​

Debug Fields​

debug_info.llm_stats​

Content Types​

Metadata Enrichment​

Groundedness Verification​

Structured Table Extraction​

Configuration​

Strategy and Model​

Chunk Sizing​

Verification​

Structured Tables​

Custom Instructions​

What You Can Do​

Instruction Precedence​

Example​

Supported File Types

Chunking Strategies

Heading-Driven Chunking

Density-Driven Chunking

Chunk Schema

Core Fields

Metadata Fields

Document-Level Fields

Structured Data Fields

Debug Fields

`debug_info.llm_stats`

Content Types

Metadata Enrichment

Groundedness Verification

Structured Table Extraction

Configuration

Strategy and Model

Chunk Sizing

Verification

Structured Tables

Custom Instructions

What You Can Do

Instruction Precedence

Example