Data Flows
Data flows define and describe the path of data through the Nexla platform, from source to destination. The primary resources in any flow are its data sets, which are chained together in acyclic tree structures and are associated with resources describing the source, sharing and destinations of the data.
Flow Architecture
The Nexla platform uses a modern, flexible data flow architecture that provides greater control and efficiency. This architecture allows you to create complex data processing pipelines while maintaining simplicity and performance.
Modern Flow Structure
In the current Nexla platform, data flows use a modern architecture that includes:
- Flow Nodes: Each flow is composed of flow nodes that represent the flow structure
- Data Sets: Data sets are the primary resources making up data flows
- Flexible Connections: Any number of data sets can be chained together in directed acyclic graphs
- Direct Associations: Data sinks can be directly associated with any data set
- Sharing: Data sets can be shared directly without the need for intermediate publication/subscription resources
Flow Components
A typical data flow consists of:
- Data Source: The origin of data (e.g., S3 bucket, database, API)
- Data Sets: Intermediate data processing stages
- Transforms: Data transformations applied to data sets
- Data Sinks: Destinations where processed data is written
- Flow Nodes: Internal representation of the flow structure
Flow Structure
Flow resources are nested JSON objects. The root object contains a flows
array containing one or more complete data flows, which normally begin at a data set associated with a data source and terminate in a data set or data sink.
Each data set object in a data flow contains various attributes describing the data set itself (e.g., name, description, etc.) and may contain:
- Data Source Association: Direct link to the source data
- Data Sinks: Direct associations with destination resources
- Sharing Configuration: Access control and sharing settings
- Children: Downstream data sets in the flow
Basic Flow Structure
The following example shows the basic tree structure of a flow:
{
"flows": [
{
"id": 10001,
"flow_node_id": 10001,
"origin_node_id": 10001,
"flow_type": "streaming",
"status": "ACTIVE",
"created_at": "2023-01-15T10:30:00.000Z",
"updated_at": "2023-01-15T10:30:00.000Z",
"data_source": {
"id": 5001,
"name": "Example Data Source",
"status": "ACTIVE"
},
"data_sinks": [],
"sharers": {
"sharers": [],
"external_sharers": []
},
"children": [
{
"id": 10002,
"flow_node_id": 10002,
"parent_flow_node_id": 10001,
"status": "ACTIVE",
"data_sinks": [
{
"id": 6001,
"name": "Example Data Sink",
"status": "ACTIVE"
}
],
"sharers": {
"sharers": [],
"external_sharers": []
},
"children": []
}
]
}
],
"data_sources": [ ... ],
"data_sets": [ ... ],
"data_sinks": [ ... ],
"data_credentials": [ ... ],
"orgs": [ ... ],
"users": [ ... ]
}
Flow Types
Nexla supports three main flow types:
1. Streaming (Default)
- Purpose: Standard streaming data processing
- Use Case: Real-time or near-real-time data processing
- Performance: Optimized for continuous data flow
- Resource Usage: Moderate resource consumption
2. In-Memory
- Purpose: High-performance in-memory data processing
- Use Case: Fast data transformations and analytics
- Performance: Highest performance for data processing
- Resource Usage: Higher memory consumption
3. Replication
- Purpose: Data replication and synchronization
- Use Case: Data backup, migration, and synchronization
- Performance: Optimized for data transfer
- Resource Usage: Lower processing overhead
Flow Lifecycle
Every data flow in Nexla follows a defined lifecycle that determines how data moves through the system. Understanding these lifecycle stages helps you manage flows effectively and troubleshoot any issues that arise.
Flow States
Data flows can exist in several states:
- INIT: Flow is created but not yet configured
- PAUSED: Flow is paused and not processing data
- ACTIVE: Flow is running and processing data
- ERROR: Flow has encountered an error and stopped
Flow Operations
Common flow operations include:
- Create: Set up a new data flow
- Activate: Start data processing
- Pause: Stop data processing temporarily
- Update: Modify flow configuration
- Delete: Remove the flow and its resources
Flow Management
Effective flow management involves controlling access, sharing resources, and monitoring performance. These aspects ensure your flows operate securely and efficiently.
Access Control
Flow access can be managed using any node or resource within the flow, but it always grants access to the entire flow, starting at the origin node. Granting access to a sub-flow only is not supported.
Flow Sharing
Data sets can be shared directly with other users or organizations:
- User Sharing: Share with specific users
- Organization Sharing: Share with entire organizations
- External Sharing: Share with external users via email
Flow Monitoring
Monitor flow performance and health through:
- Metrics: Data processing rates, error rates, latency
- Logs: Detailed processing logs and error information
- Alerts: Notifications for flow issues or performance degradation
Best Practices
Following these best practices helps you create robust, maintainable, and efficient data flows that scale with your needs.
Flow Design
- Keep Flows Simple: Design flows with clear, logical data paths
- Use Descriptive Names: Name data sets and flows meaningfully
- Plan for Scaling: Consider future growth when designing flows
- Document Dependencies: Clearly document relationships between flow components
Performance Optimization
- Choose Appropriate Flow Type: Use streaming for real-time, in-memory for high-performance
- Optimize Data Sets: Minimize unnecessary data transformations
- Monitor Resource Usage: Track memory and processing resource consumption
- Use Efficient Transforms: Select appropriate transformation methods
Maintenance
- Regular Monitoring: Monitor flow health and performance
- Version Control: Use export/import for flow versioning
- Testing: Test flow changes in non-production environments
- Documentation: Keep flow documentation up to date
API Endpoints
The flows API provides comprehensive endpoints for managing data flows:
- List Flows:
GET /flows
- Show Flow:
GET /flows/{flow-node-id}
- Show Flow by Resource:
GET /data_sources/{id}/flow
- Activate Flow:
PUT /flows/{flow-node-id}/activate
- Pause Flow:
PUT /flows/{flow-node-id}/pause
- Delete Flow:
DELETE /flows/{flow-node-id}
For detailed information about specific operations, see the individual documentation pages for each flow management task.