Skip to main content

Document Chunking

Before documents can be searched by the Agentic RAG agent, they pass through a chunking pipeline that splits each file into semantically meaningful chunks, enriches them with metadata, and verifies factual accuracy. The quality of these chunks directly affects retrieval precision and answer quality.

This page describes how the pipeline processes documents, what each chunk contains, and what you can configure.

Supported File Types

File TypeExtensionsProcessing Approach
PDF.pdfText extraction via pdfplumber. Pages with sparse text but rich visual content (charts, scanned pages) are rendered to images and processed with Gemini vision OCR
CSV.csvParsed into a DataFrame. Rows are grouped into blocks and converted from tabular format into readable prose chunks
Excel.xlsx, .xlsEach sheet is processed independently. Row blocks are chunked with tabular-specific prompts, preserving sheet context
Image.png, .jpg, .jpegSent directly to Gemini vision. Returns a single chunk with the extracted content
note

For PDFs, the pipeline profiles each page to determine whether it contains primarily text or visual content (charts, diagrams, scanned text). Pages that fall below a configurable text threshold (default: 200 characters) are processed via vision OCR at 300 DPI, in batches of up to 4 pages per call.

Chunking Strategies

The pipeline supports two chunking strategies. Heading-driven chunking is the default; density-driven chunking activates automatically as a fallback.

Heading-Driven Chunking

The primary strategy. It detects headings throughout the document, classifies them by type and level, and uses heading boundaries to split content into structurally coherent chunks.

How it works:

  1. Heading detection — An LLM pass identifies headings with their text, hierarchy level (1–6), type (section, page, bridge, global, footer), and content type
  2. Section building — Content between heading boundaries becomes a section
  3. Section merging — Single-line sections and sections below the minimum size (default: 400 characters) are merged with adjacent sections
  4. Size carving — Sections exceeding the maximum size (default: 1,800 characters) are split at natural break points. Up to 4 passes handle outliers
  5. Parent-child linking — A heading hierarchy is built so child chunks carry their parent section title for context
  6. Metadata enrichment — Each chunk receives LLM-generated metadata (topic, summary, content type, potential questions)

Density-Driven Chunking

The fallback strategy. It activates when a document block has fewer than 2 boundary headings (section, page, or bridge type), making structural chunking unreliable.

How it works:

  1. Page grouping — Pages are grouped into blocks by estimated token count, using median absolute deviation (MAD) thresholding to determine target block size
  2. LLM chunking — Each block is sent to the LLM with a chunking prompt that produces semantically coherent chunks
  3. Timeout recovery — If a block times out and exceeds 4,000 characters, it is recursively split and retried (max depth: 3)

Chunk Schema

Every chunk produced by the pipeline contains these fields:

Core Fields

FieldTypeDescription
chunk_numberstringSequential chunk number within the document, starting from "1"
page_numberstringSource page number (PDF only). Derived from page markers in the extracted text
sheet_namestringSource sheet name (Excel only)
chunk_textstringThe chunk content as well-structured prose. Tables and lists are converted to flowing text with all data preserved
source_infostringSource filename

Metadata Fields

FieldTypeDescription
topicstringConcise topic title (3–10 words), grounded in the chunk text
summarystring1–2 sentence summary, traceable to the chunk text
sectionstringParent section heading. For child chunks in a heading hierarchy, this is the enclosing section title
sub_topicsstringComma-separated list of sub-topics, or empty
universalstringPipe-separated universal headings (e.g., "Acme Corp | 2024-03-15 | Confidential"). These are headings that apply across the entire document
section_referencestringExact legal or structural citation found in the text (e.g., "Section 3(c)(ii)", "Article 12.4", "Exhibit G"), or empty
content_typestringDominant content type of the chunk. See Content Types
potential_questionsarray3–5 specific questions answerable from this chunk's text alone
customer_specific_tagsarrayCustom tags derived from domain-specific instructions, or empty array

Document-Level Fields

These fields are populated once at the document level and attached to every chunk:

FieldTypeDescription
freshness_datestringMost relevant date in the document, normalized to YYYY-MM-DD. Supports ISO dates, "DD Month YYYY", "Month YYYY", quarter formats (Q1 2024, 2024-Q2, 1Q24), and year-only
freshness_date_unixintegerUnix timestamp of freshness_date

Structured Data Fields

FieldTypeDescription
structured_tablesstringJSON-encoded array of extracted tables (see Structured Table Extraction). Empty string when no tables are extracted

Debug Fields

FieldTypeDescription
debug_infoobjectDiagnostic metadata. Contains chunker_result ("Success" or "Fail"), file_type, model, and llm_stats with token counts and page-level statistics

debug_info.llm_stats

FieldTypeDescription
input_tokens_per_fileintegerTotal input tokens consumed across all LLM calls for this file
output_tokens_per_fileintegerTotal output tokens generated
total_tokens_per_fileintegerSum of input and output tokens
total_llm_calls_per_fileintegerNumber of LLM API calls made
total_pagesintegerTotal pages in the source document (PDF only)
pages_with_chunksintegerPages that produced at least one chunk
pages_with_zero_chunksintegerPages that produced no chunks (e.g., blank pages, headers-only pages)

Content Types

The content_type field classifies each chunk's dominant content:

ValueDescription
narrativeExplanatory prose, background information, descriptions
metricsQuantitative data, tables, numerical figures, financial data
instructionsStep-by-step procedures, how-to content
comparisonsParallel structures comparing items, side-by-side analyses
legalDisclaimers, clauses, formal legal language, terms and conditions
conceptTerm definitions, glossary entries, conceptual explanations
otherContent that does not fit the categories above

Metadata Enrichment

Chunks are enriched through a multi-stage pipeline:

  1. Document metadata — The LLM extracts the file title, freshness date, and a document-level summary from the first pages of the document. Custom instructions can declare additional document-level fields
  2. Structure planning — A document plan is generated from a sample of pages, identifying the document type, heading hierarchy depth, table density, figure count, and whether scanned pages are suspected
  3. Chunk-level metadata — After chunking, each chunk is sent through a batch metadata enrichment pass (up to 6 chunks per LLM call) that generates the topic, summary, sub_topics, section_reference, content_type, potential_questions, and customer_specific_tags fields

Groundedness Verification

When enabled, the pipeline verifies that chunk metadata is faithful to the source text. This catches hallucinated summaries, inaccurate topics, or questions that cannot actually be answered from the chunk.

How it works:

  1. Sampling — A deterministic sample of chunks is selected (default: 35% of chunks, minimum 6)
  2. Verification — Each sampled chunk is evaluated by the LLM, which checks whether the summary, topic, content type, and potential questions are grounded in the chunk text. Each chunk receives a severity rating: ok, minor, or major
  3. Drift assessment — The fraction of major failures in the sample is calculated. If drift is below 5%, no retry is needed. If drift exceeds 20% and major failures are present, the retry loop activates
  4. Retry loop — Failed chunks are re-sent for metadata generation with the specific issues flagged. The loop stops when drift improves, or when guardrails are hit:
    • Maximum 30 chunks retried per document
    • Maximum $0.10 USD cost for the verify-and-retry cycle
    • Maximum 180 seconds wall-clock time
    • Monotonic improvement required (no improvement = stop)
tip

Verification adds latency and cost but significantly improves metadata quality. For high-stakes document collections where answer accuracy is critical, keep it enabled (the default).

Structured Table Extraction

For chunks classified as metrics or comparisons, the pipeline can extract structured table data as machine-readable JSON.

How it works:

  1. Candidate selection — Chunks with content_type of metrics or comparisons are identified
  2. Batch extraction — Candidate chunks are sent to the LLM in batches of 4, which extracts tables as JSON objects with title, headers, rows, and notes
  3. Validation — Each extracted cell value is checked against the source chunk text. Tables where more than 30% of cells are not found in the source are dropped. Tables with 10–30% ungrounded cells are flagged
  4. Cross-chunk bridging — When a table spans multiple chunks, the pipeline matches tables across chunk boundaries by header similarity (threshold: 75%). Matched continuation rows are merged into the parent table with provenance annotations

The result is stored as a JSON string in the structured_tables field. Each table object has this shape:

{
"title": "Q2 2024 Revenue by Region",
"headers": ["Region", "Revenue", "Growth"],
"rows": [
["West", "$1.8M", "18%"],
["East", "$1.2M", "8%"]
],
"notes": "Source: Q2 Quarterly Report, p. 7"
}

Configuration

The pipeline is configured through environment variables. All settings have sensible defaults.

Strategy and Model

VariableTypeDefaultDescription
GEMINI_MODELstringgemini-2.5-flashPrimary Gemini model used for chunking, metadata, and verification
CHUNKER_USE_HEADING_DRIVENbooleantrueEnable heading-driven chunking as the primary strategy. When false, all documents use density-driven chunking

Chunk Sizing

VariableTypeDefaultDescription
CHUNK_MAX_CHARSinteger1800Sections larger than this (in characters) are split via size-aware carving
CHUNK_MIN_CHARSinteger400Minimum chunk size. Fragments smaller than this are merged with adjacent content

Verification

VariableTypeDefaultDescription
CHUNKER_VERIFY_CHUNKSbooleantrueEnable groundedness verification and retry loop

Structured Tables

VariableTypeDefaultDescription
CHUNKER_STRUCTURED_TABLESbooleantrueEnable structured table extraction for metrics and comparisons chunks
note

Boolean environment variables accept 1, true, or yes (case-insensitive) as truthy values. Any other value is treated as false.

Custom Instructions

You can pass domain-specific instructions to customize how documents are chunked and what metadata is extracted. Custom instructions are injected into the LLM prompts at every stage — document metadata extraction, chunking, metadata enrichment, and verification.

What You Can Do

  • Guide chunking behavior — Provide context about the document domain (e.g., "These are pharmaceutical trial reports; preserve dosage tables verbatim")
  • Declare custom per-chunk fields — Define additional fields to extract by specifying them with a "field_name": syntax. Field names must be lowercase snake_case
  • Declare custom document-level fields — Add fields to the document metadata stage (title, dates, custom identifiers)

Instruction Precedence

When multiple instruction sources are available, the pipeline uses the first non-empty source:

  1. extra_data["instructions"] passed in the API call
  2. customInstructions parameter (legacy)
  3. Deployment-level inline instructions (baked into the deployment)
  4. No custom instructions (generic mode)

Example

Custom instructions like:

These documents are commercial real estate leases. Extract the following per-chunk fields:
"lease_clause_type": The type of lease clause (rent, maintenance, termination, renewal, etc.)
"effective_date_range": The date range this clause applies to, if stated.

Would produce chunks with the standard fields plus lease_clause_type and effective_date_range as additional keys.