Inspect Source Data
Data inspection capabilities allow you to explore and understand the structure and content of your data sources before configuring ingestion. These tools help you determine the optimal configuration settings and verify data quality.
Inspect Source Content Hierarchy
You can explore the tree structure of file-based and hierarchical data sources to understand their organization. This is particularly useful for sources like S3, FTP, and file systems where data is organized in directories and subdirectories.
Tree Structure Endpoint
- Nexla API
POST /data_sources/{source_id}/probe/tree
Example Request Body:
{
"region": "us-west-1",
"bucket": "production-data-bucket",
"prefix": "events/",
"depth": 3
}
Response Structure
The tree endpoint returns a hierarchical view of your data organization:
- Nexla API
{
"status": "ok",
"output": {
"events": {
"2023": {
"01": {
"15": {},
"16": {},
"17": {}
},
"02": {
"01": {},
"02": {},
"03": {}
}
},
"2024": {
"01": {
"01": {},
"02": {}
}
}
}
}
}
Tree Inspection Benefits
Exploring the tree structure helps you:
- Understand Organization: See how data is organized in your source
- Plan Ingestion: Determine optimal prefix and path configurations
- Estimate Volume: Assess the amount of data available
- Identify Patterns: Recognize naming conventions and structures
Inspect Sample File Content
You can examine individual files to understand their format, structure, and content quality. This is essential for configuring proper data processing and transformation.
File Content Endpoint
- Nexla API
POST /data_sources/{source_id}/probe/files
Example Request Body:
{
"path": "events/2023/01/15/customer_data.json"
}
File Path Requirements
The file path must:
- Start from Source Root: Begin with the location specified in source configuration
- Be Accessible: Exist and be readable by the source credentials
- Match Patterns: Conform to any file pattern filters in source_config
Response Structure
- Nexla API
{
"status": 200,
"message": "Ok",
"output": {
"format": "json",
"size": 2048,
"last_modified": "2023-01-15T10:30:00.000Z",
"messages": [
{
"customer_id": "C001",
"transaction_date": "2023-01-15",
"amount": 150.75,
"product": "Widget A",
"category": "Electronics"
},
{
"customer_id": "C002",
"transaction_date": "2023-01-15",
"amount": 89.99,
"product": "Widget B",
"category": "Home & Garden"
}
]
},
"connection_type": "s3"
}
Content Analysis Benefits
File inspection provides valuable insights:
- Data Format: Identify file types and encoding
- Schema Discovery: Understand data structure and field types
- Quality Assessment: Evaluate data completeness and consistency
- Processing Requirements: Determine transformation needs
Schema Detection
Automatic schema detection helps you understand the structure of your data without manual inspection.
Schema Detection Endpoint
- Nexla API
POST /data_sources/{source_id}/probe/schema
Example Request Body:
{
"sample_size": 100,
"file_pattern": "*.csv"
}
Schema Response
- Nexla API
{
"status": "ok",
"schema": {
"fields": [
{
"name": "customer_id",
"type": "string",
"nullable": false,
"sample_values": ["C001", "C002", "C003"]
},
{
"name": "transaction_date",
"type": "date",
"nullable": false,
"sample_values": ["2023-01-15", "2023-01-16"]
},
{
"name": "amount",
"type": "decimal",
"nullable": true,
"sample_values": [150.75, 89.99, 200.00]
}
],
"total_records": 1000,
"file_count": 15
}
}
Data Quality Assessment
Evaluate the quality and characteristics of your data sources:
Quality Metrics
- Completeness: Percentage of non-null values in each field
- Consistency: Uniformity of data formats and values
- Accuracy: Validation of data against expected patterns
- Timeliness: Freshness of data and update frequency
Quality Endpoint
- Nexla API
POST /data_sources/{source_id}/probe/quality
Example Request Body:
{
"sample_size": 1000,
"quality_checks": ["completeness", "consistency", "format"]
}
Configuration Optimization
Use inspection results to optimize your source configuration:
Path and Pattern Optimization
Based on tree inspection:
{
"source_config": {
"prefix": "events/2023/",
"file_pattern": "*.json",
"exclude_patterns": ["*.tmp", "*.backup"]
}
}
Schema-based Configuration
Based on schema detection:
{
"source_config": {
"schema_detection": "auto",
"field_mapping": {
"customer_id": "string",
"transaction_date": "date",
"amount": "decimal"
}
}
}
Best Practices
To maximize the value of data inspection:
- Start with Tree View: Understand the overall data organization first
- Sample Multiple Files: Examine different file types and sizes
- Validate Assumptions: Confirm your understanding of data structure
- Document Findings: Keep records of discovered patterns and issues
- Test Configurations: Use inspection results to test source configurations
- Monitor Changes: Regularly inspect data as sources evolve
Error Handling
Common inspection errors and solutions:
- Access Denied: Verify credentials and permissions
- Invalid Paths: Ensure file paths match source configuration
- Large Files: Handle timeouts for very large files
- Unsupported Formats: Check connector support for file types
- Rate Limiting: Respect API rate limits for external sources
Related Operations
After inspecting your data, you may need to:
Update Source Configuration
PUT /data_sources/{source_id}
Validate Configuration
POST /data_sources/{source_id}/config/validate
Test Connection
PUT /data_sources/{source_id}/test