Skip to main content

Inspect Source Data

Data inspection capabilities allow you to explore and understand the structure and content of your data sources before configuring ingestion. These tools help you determine the optimal configuration settings and verify data quality.

Inspect Source Content Hierarchy

You can explore the tree structure of file-based and hierarchical data sources to understand their organization. This is particularly useful for sources like S3, FTP, and file systems where data is organized in directories and subdirectories.

Tree Structure Endpoint

Inspect Source Content Hierarchy: Request
POST /data_sources/{source_id}/probe/tree

Example Request Body:

{
"region": "us-west-1",
"bucket": "production-data-bucket",
"prefix": "events/",
"depth": 3
}

Response Structure

The tree endpoint returns a hierarchical view of your data organization:

Inspect Source Content Hierarchy: Response
{
"status": "ok",
"output": {
"events": {
"2023": {
"01": {
"15": {},
"16": {},
"17": {}
},
"02": {
"01": {},
"02": {},
"03": {}
}
},
"2024": {
"01": {
"01": {},
"02": {}
}
}
}
}
}

Tree Inspection Benefits

Exploring the tree structure helps you:

  • Understand Organization: See how data is organized in your source
  • Plan Ingestion: Determine optimal prefix and path configurations
  • Estimate Volume: Assess the amount of data available
  • Identify Patterns: Recognize naming conventions and structures

Inspect Sample File Content

You can examine individual files to understand their format, structure, and content quality. This is essential for configuring proper data processing and transformation.

File Content Endpoint

Inspect File Content: Request
POST /data_sources/{source_id}/probe/files

Example Request Body:

{
"path": "events/2023/01/15/customer_data.json"
}

File Path Requirements

The file path must:

  • Start from Source Root: Begin with the location specified in source configuration
  • Be Accessible: Exist and be readable by the source credentials
  • Match Patterns: Conform to any file pattern filters in source_config

Response Structure

Inspect File Content: Response
{
"status": 200,
"message": "Ok",
"output": {
"format": "json",
"size": 2048,
"last_modified": "2023-01-15T10:30:00.000Z",
"messages": [
{
"customer_id": "C001",
"transaction_date": "2023-01-15",
"amount": 150.75,
"product": "Widget A",
"category": "Electronics"
},
{
"customer_id": "C002",
"transaction_date": "2023-01-15",
"amount": 89.99,
"product": "Widget B",
"category": "Home & Garden"
}
]
},
"connection_type": "s3"
}

Content Analysis Benefits

File inspection provides valuable insights:

  • Data Format: Identify file types and encoding
  • Schema Discovery: Understand data structure and field types
  • Quality Assessment: Evaluate data completeness and consistency
  • Processing Requirements: Determine transformation needs

Schema Detection

Automatic schema detection helps you understand the structure of your data without manual inspection.

Schema Detection Endpoint

Detect Schema: Request
POST /data_sources/{source_id}/probe/schema

Example Request Body:

{
"sample_size": 100,
"file_pattern": "*.csv"
}

Schema Response

Schema Detection Response
{
"status": "ok",
"schema": {
"fields": [
{
"name": "customer_id",
"type": "string",
"nullable": false,
"sample_values": ["C001", "C002", "C003"]
},
{
"name": "transaction_date",
"type": "date",
"nullable": false,
"sample_values": ["2023-01-15", "2023-01-16"]
},
{
"name": "amount",
"type": "decimal",
"nullable": true,
"sample_values": [150.75, 89.99, 200.00]
}
],
"total_records": 1000,
"file_count": 15
}
}

Data Quality Assessment

Evaluate the quality and characteristics of your data sources:

Quality Metrics

  • Completeness: Percentage of non-null values in each field
  • Consistency: Uniformity of data formats and values
  • Accuracy: Validation of data against expected patterns
  • Timeliness: Freshness of data and update frequency

Quality Endpoint

Assess Data Quality: Request
POST /data_sources/{source_id}/probe/quality

Example Request Body:

{
"sample_size": 1000,
"quality_checks": ["completeness", "consistency", "format"]
}

Configuration Optimization

Use inspection results to optimize your source configuration:

Path and Pattern Optimization

Based on tree inspection:

{
"source_config": {
"prefix": "events/2023/",
"file_pattern": "*.json",
"exclude_patterns": ["*.tmp", "*.backup"]
}
}

Schema-based Configuration

Based on schema detection:

{
"source_config": {
"schema_detection": "auto",
"field_mapping": {
"customer_id": "string",
"transaction_date": "date",
"amount": "decimal"
}
}
}

Best Practices

To maximize the value of data inspection:

  1. Start with Tree View: Understand the overall data organization first
  2. Sample Multiple Files: Examine different file types and sizes
  3. Validate Assumptions: Confirm your understanding of data structure
  4. Document Findings: Keep records of discovered patterns and issues
  5. Test Configurations: Use inspection results to test source configurations
  6. Monitor Changes: Regularly inspect data as sources evolve

Error Handling

Common inspection errors and solutions:

  • Access Denied: Verify credentials and permissions
  • Invalid Paths: Ensure file paths match source configuration
  • Large Files: Handle timeouts for very large files
  • Unsupported Formats: Check connector support for file types
  • Rate Limiting: Respect API rate limits for external sources

After inspecting your data, you may need to:

Update Source Configuration

PUT /data_sources/{source_id}

Validate Configuration

POST /data_sources/{source_id}/config/validate

Test Connection

PUT /data_sources/{source_id}/test