Common Setup for File-Based Storage Systems
This article provides general information about connecting to file-based storage systems in Nexla.
1. File-Based Storage Systems & Nexla
File-based data storage systems are one of the most efficient ways to store, organize, and move large volumes of data. In these systems, data is stored in a hierarchical structure consisting of files located inside one or more folders.
Examples of file-based data storage systems include cloud services—such as Amazon S3, Azure Blob Storage, Box, Google Cloud Storage, and Google Drive—as well as FTP, SFTP, and FTPS servers and local hard-drive storage systems.
Nexla makes ingesting data from file-based storage systems a simple and quick process. Data ingested from these systems can be transformed and/or sent to any destination in only a few steps. Data flows originating from file-based storage systems can be constructed to suit any use case, and Nexla's comprehensive governance and troubleshooting tools allow users to monitor every aspect of the flow status, data lineage, and more.
2. Data Sources: Connecting to File Systems
With Nexla's connectors, users can quickly and easily add any file-based storage system as a data source to begin ingesting, transforming, and moving data in any format. This section provides general instructions and information about connecting to file-based storage systems.
Each of Nexla's data flow types includes destination configuration options that are specific to the flow type. Click the link corresponding to your flow type in the list below to learn how to configure the destination and complete the destination setup process.
FlexFlow:
FlexFlow is a flexible all-in-one data flow type that can be used to create both streaming and real-time data flows that can be used to transform data and/or move data from any source to any destination. This flow type uses the Kafka engine to facilitate seamless high-throughput movement of data from any source to any destination. FlexFlow is the recommended flow type for most workflows.
DirectFlow:
DirectFlow data flows are designed for high-throughput point-to-point data processing wherein minimizing data processing latency is a critical need. This flow type is ideal for use in batch processing with non-streaming data sources & destinations.
ELT:
ELT data flows are streamlined for seamless movement of unmodified data from APIs into databases and data warehouses. With minimal configuration required, these flows can be set up rapidly and effortlessly to support ELT workflows.
FlexFlow Data Flows
-
After logging into Nexla, navigate to the Integrate section by selecting
from the platform menu on the left side of the screen.
-
Click
at the top of the Integrate toolbar on the left.

- Select FlexFlow from the list of flow types, and click
to proceed to data source creation.

- Select the connector tile that matches the file system from which data will be ingested in this flow. Once a connector is selected, the platform will automatically advance to the next setup step.

-
In the Authenticate screen, follow the instructions below to create or select the credential that will be used to connect to the data source.
To create a new credential:
-
Select the Add Credential tile.
-
Enter and/or select the required information in the Add New Credential pop-up.
-
Once all of the required information has been entered, click
at the bottom of the pop-up to save the new credential, and proceed to Configure the Data Source.
To use a previously added or shared credential:
-
Select the credential from the list.
-
Click
in the upper right corner of the screen.
-
Configure the Data Source
- In the Configure screen, enter a name for the data source in the Source Name or Name field.

-
Optional: Enter a brief description of the data source in the Description field (if present).
Resource DescriptionsResource descriptions should provide information about the resource purpose, data freshness, etc. that can help the owner and other users efficiently understand and utilize the resource.

- The subsections below provide information about additional settings available for file system sources in FlexFlow data flows. Follow the listed instructions to configure each setting for this data source, and then proceed to Save & Activate the Data Source.
Source Folder
In Nexla, data sources in FlexFlow data flows can be configured to ingest all data from any accessible folder or subfolder within the file system.
-
Under the Choose Source Folder section, navigate to the folder from which Nexla should read data; then, hover over the listing, and click the
icon to select this folder.
- To view/select a nested folder, click the
icon next to a listed folder to expand it.
- To view/select a nested folder, click the

Data Selection
In FlexFlow flows, users can optionally configure Nexla to read only some files within the selected location according to the most recent file modification date.
▷ Include all files in the source location (Default):
- By default, Nexla will read all files within the selected location. To continue with this configuration, no action is needed.
▷ Include only files modified after a specified date:
- When Nexla should only ingest newer or recently modified files from the data source, the platform can be configured to selectively ingest files modified after a specified date and time. To specify the file modification date and time that will be used to select which files should be read from this source, click the
icon in the Only read files modified after: field under, and select the date from the dropdown calendar.

- Optional: In the field at the bottom of the calendar, enter the time (in 24-h format) on the selected date that should be referenced when identifying new and/or modified files from the source.

Data Format
By default, Nexla automatically detects the format of files ingested from data sources and parses the data contained in the files accordingly. Automatic file format detection is recommended for most workflows, including when the source contains files in more than one format.
For more information about Nexla's automatic file format detection, see the Automatic File Format Detection section in Supported File Formats.
For specialized use cases, users can designate a specific incoming file format for a data source in a FlexFlow data flow, forcing Nexla to parse all files ingested from the source according to the designated file format.
-
To specify the format that will be used to parse files from this source, select the appropriate format from the Choose a Pre-Built File Processort pulldown menu under the Define Processing Logic section.
-
Some file formats require additional information, such as delimiter, schema, metadata, and relevant data settings. For instructions on completing these settings, follow the corresponding link below:
Custom Text Files
Compressed ZIP and/or TAR Files
EDI Files
Excel Files
Fixed-Width Files
JSON Files
Log Files
PDF Files
XML Files
Ingestion Speed
When the data ingestion speed is a critical factor in a FlexFlow data flow, users can optionally assign a higher ingestion speed factor to the flow. When a speed factor greater than 1x
is selected, Nexla will execute the data flow with a higher capacity and greater throughput infrastructure.
Enabling a higher speed factor will result in a significant increase in the billable charges for the associated account.
Please consult your Account Manager before modifying data flow speed factor settings.
- To enable a higher data ingestion speed for this data flow, select the desired multiplier from the Speed Factor pulldown menu under the Pipeline Ingest Speed section.

Scheduling
Scan scheduling options can be used to define the freqency at which the data source will be scanned for data changes in a FlexFlow data flow. Any data changes identified during a scan will then be replicated into the configured destination.
-
By default, when a new FlexFlow data flow is created, Nexla is configured to scan the source for data changes once every day. To continue with this setting, no further selections are required. Proceed to Section 2.2.
-
To define how often Nexla should scan the data source for new data changes, select an option from the Check for Files pulldown menu under the Scheduling section.
- When options such as Every N Days or Every N Hours, a secondary pulldown menu will be populated. Select the appropriate value of N from this menu.

- To specify the time at which Nexla should scan the source for new data changes, use the pulldown menu(s) to the right of the Check For Files menu. These time menus vary according to the selected scan frequency.

Save & Activate the Data Source
After all required settings and any desired additional options are configured, click in the top right corner of the screen to save & activate the data source.

Once the data source is created, Nexla will automatically scan it for data according to the configured settings. Identified data will be organized into a Nexset, which is a logical data product that is immediately ready to be sent to a destination.

Spark ETL Data Flows
-
After logging into Nexla, navigate to the Integrate section by selecting
from the platform menu on the left side of the screen.
-
Click
at the top of the Integrate toolbar on the left.

- Select Spark ETL from the list of flow types, and click
to proceed to data source creation.

- In the Connect screen, select the connector tile matching the data source type from the list.

- In the Authenticate screen, select or create the credential that will be used to connect to the data source, and click
. Detailed information about credential creation for specific sources can be found on the Connectors page.

Configure the Data Source
- In the Configure screen, enter a name for the data source in the Name or Source Name field.

-
Optional: Enter a brief description of the data source in the Description field.
Resource DescriptionsResource descriptions should provide information about the resource purpose, data freshness, etc. that can help the owner and other users efficiently understand and utilize the resource.

- The subsections below provide information about additional settings available for file system sources in Spark ETL data flows. Follow the listed instructions to configure each setting for this data source, and then proceed to Save & Activate the Data Source.
Source Folder
In Spark ETL flows, data sources can be configured to ingest all data in the overall accessible with the selected credential or only data in subfolders/tables within a specific location.
-
Under the Source Folder section, navigate to the directory location from which Nexla will read files from this source; then, hover over the listing, and click the
icon to select this location.
- To view/select a nested location, click the
icon next to a listed folder to expand it.
- To view/select a nested location, click the

- The selected directory location is displayed at the top of the Source Folder section.

Cloud Credential
Spark ETL data flows require a cloud credential in addition to the data credential selected in the Authenticate step above.
Cloud credentials are used to access the cloud storage provider and allow Nexla to spawn and stop virtual machines, data warehouses, and other similar entities required for Spark ETL flow execution.
- Select the type of cloud credential that will be used to connect to the selected data source location from the Cloud Credential Type pulldown menu.

-
Add or select the cloud credential that will be used.
-
To add a new cloud credential:
-
Select the Add Credential icon.
-
In the Add New Credential window, enter the required information, and click
.
-
-
To use an existing cloud credential, select the credential from the list.
-
Apache Hive
-
To treat all subdirectiries detected within the chosen data source location as Apache Hive partitions, check the box next to Treat all subdirectories as Apache Hive partitions.
Data Format
By default, Nexla automatically detects the format of files ingested from data sources and parses the data contained in the files accordingly. Automatic file format detection is recommended for most workflows, including when the source contains files in more than one format.
For more information about Nexla's automatic file format detection, see the Automatic File Format Detection section in Supported File Formats.
For specialized use cases, users can designate a specific incoming file format for a data source in a Spark ETL flow, forcing Nexla to parse all files ingested from the source according to the designated file format.
-
To specify the format that will be used to parse files from this source, select the appropriate format from the File Content Format pulldown menu under the Data Format section.
-
Some file formats require additional information, such as delimiter, schema, metadata, and relevant data settings. For instructions on completing these settings, follow the corresponding link below:
Custom Text Files
Compressed ZIP and/or TAR Files
EDI Files
Excel Files
Fixed-Width Files
JSON Files
Log Files
PDF Files
XML Files
Data Selection
When setting up the data source in a Spark ETL flow, Nexla provides configuration options for specifying which data should be ingested from the source, allowing users to customize data ingestion to suit various use cases. Data can be selected for ingestion from file-based storage systems according to file modification dates, naming patterns, and/or subfolder paths.
The settings discussed in this section are located under the Data Selection category.
▷ To ingest all files in the selected location:
-
To configure Nexla to ingest all files from the data source, regardless of when the files were added or modified, delete the pre-populated date and time from the Only read files modified after: field, and leave this field blank.
▷ To ingest files according to the most recent modification date:
- When Nexla should only ingest newer or recently modified files from the data source, the platform can be configured to selectively ingest files modified after a specified date and time. To specify the file modification date and time that will be used to select which files should be read from this source, click the
icon in the Only read files modified after: field under, and select the date from the dropdown calendar.

- In the field at the bottom of the calendar, enter the time (in 24-h format) on the selected date that should be referenced when identifying new and/or modified files from the source.

Scheduling
Scan scheduling options can be used to define the freqency at which the data source will be scanned for data changes in a Spark ETL flow. Any data changes identified during a scan will then be replicated into the configured destination.
-
By default, when a new Spark ETL data flow is created, Nexla is configured to scan the source for data changes once every day. To continue with this setting, no further selections are required. Proceed to Section 2.2.
-
To define how often Nexla should scan the data source for new data changes, select an option from the Check for Files pulldown menu under the Scheduling section.
- When options such as Every N Days or Every N Hours, a secondary pulldown menu will be populated. Select the appropriate value of N from this menu.

- To specify the time at which Nexla should scan the source for new data changes, use the pulldown menu(s) to the right of the Check For Files menu. These time menus vary according to the selected scan frequency.

Save & Activate the Data Source
After all required settings and any desired additional options are configured, click in the top right corner of the screen to save & activate the data source.

Once the data source is created, Nexla will automatically scan it for data according to the configured settings. Identified data will be organized into a Nexset, which is a logical data product that is immediately ready to be sent to a destination.
New Spark ETL Data Flow with Data Source & Detected Nexset

DirectFlow Data Flows
-
After logging into Nexla, navigate to the Integrate section by selecting
from the platform menu on the left side of the screen.
-
Click
at the top of the Integrate toolbar on the left.

- Select DirectFlow from the list of flow types, and click
to proceed to data source creation.

- In the Connect screen, select the connector tile matching the file system from which data will be ingested in this flow. Once a connector is selected, the platform will automatically advance to the next setup step.

- In the Authenticate screen, select or create the credential that will be used to connect to the data source, and click
. Detailed information about credential creation for specific sources can be found on the Connectors page.

Configure the Data Source
- Enter a name for the data source in the Name field.

-
Optional: Enter a description of the data source in the Description field.
Resource DescriptionsResource descriptions should provide information about the resource purpose, data freshness, etc. that can help the owner and other users efficiently understand and utilize the resource.

- The subsections below provide information about additional settings available for file system sources in Spark ETL data flows. Follow the listed instructions to configure each setting for this data source, and then proceed to Save & Activate the Data Source.
Data Selection
When setting up a file system data source in a data flow, Nexla provides configuration options for specifying which data should be ingested from the source location, allowing users to customize data ingestion to suit various use cases. File modification dates, naming patterns, and/or subfolder paths can be used to specify which data should be ingested from the selected file system location.
The settings discussed in this section are located under the Data Selection category.
▷ To ingest all files in the selected location:
-
To configure Nexla to ingest all files from the selected location, regardless of when the files were added or modified, leave the Only read files modified after: field blank.
▷ To ingest files according to the most recent modification date:
- When Nexla should only ingest newer or recently modified files from the data source, the platform can be configured to selectively ingest files modified after a specified date and time. To specify the file modification date and time that will be used to select which files should be read from this source, click the
icon in the Only read files modified after: field under, and select the date from the dropdown calendar.

- In the field at the bottom of the calendar, enter the time (in 24-h format) on the selected date that should be referenced when identifying new and/or modified files from the source.

Data Format
By default, Nexla automatically detects the format of files ingested from data sources and parses the data contained in the files accordingly. Automatic file format detection is recommended for most workflows, including when the source contains files in more than one format.
For more information about Nexla's automatic file format detection, see the Automatic File Format Detection section in Supported File Formats.
For specialized use cases, users can designate a specific incoming file format for a data source in a DirectFlow data flow, forcing Nexla to parse all files ingested from the source according to the designated file format.
-
To specify the format that will be used to parse files from this source, select the appropriate format from the Choose a Pre-Built File Processort pulldown menu under the Define Processing Logic section.
-
Some file formats require additional information, such as delimiter, schema, metadata, and relevant data settings. For instructions on completing these settings, follow the corresponding link below:
Custom Text Files
Compressed ZIP and/or TAR Files
EDI Files
Excel Files
Fixed-Width Files
JSON Files
Log Files
PDF Files
XML Files
Ingestion Speed
When the data ingestion speed is a critical factor in a DirectFlow data flow, users can optionally assign a higher ingestion speed factor to the flow. When a speed factor greater than 1x
is selected, Nexla will execute the data flow with a higher capacity and greater throughput infrastructure.
Enabling a higher speed factor will result in a significant increase in the billable charges for the associated account.
Please consult your Account Manager before modifying data flow speed factor settings.
- To enable a higher data ingestion speed for this data flow, select the desired multiplier from the Speed Factor pulldown menu under the Pipeline Ingest Speed section.

Scheduling
Scan scheduling options can be used to define the freqency at which the data source will be scanned for data changes in a DirectFlow data flow. Any data changes identified during a scan will then be replicated into the configured destination.
-
By default, when a new DirectFlow data flow is created, Nexla is configured to scan the source for data changes once every day. To continue with this setting, no further selections are required. Proceed to Section 2.2.
-
To define how often Nexla should scan the data source for new data changes, select an option from the Check for Files pulldown menu under the Scheduling section.
- When options such as Every N Days or Every N Hours, a secondary pulldown menu will be populated. Select the appropriate value of N from this menu.

- To specify the time at which Nexla should scan the source for new data changes, use the pulldown menu(s) to the right of the Check For Files menu. These time menus vary according to the selected scan frequency.
