Skip to main content

Common Setup for File-Based Storage Systems

info

This article provides general information about working with file-based storage systems in Nexla.

  • Detailed instructions for creating a data source or destination using any of Nexla's database/data warehouse connectors are located in Section 2 & Section 3.

1. File-Based Storage Systems & Nexla

File-based data storage systems are one of the most efficient ways to store, organize, and move large volumes of data. In these systems, data is stored in a hierarchical structure consisting of files located inside one or more folders.

Examples of file-based data storage systems include cloud services—such as Amazon S3, Azure Blob Storage, Box, Google Cloud Storage, and Google Drive—as well as FTP, SFTP, and FTPS servers and local hard-drive storage systems.

Nexla makes ingesting data from file-based storage systems a simple and quick process. Data ingested from these systems can be transformed and/or sent to any destination in only a few steps. Data flows originating from file-based storage systems can be constructed to suit any use case, and Nexla's comprehensive governance and troubleshooting tools allow users to monitor every aspect of the flow status, data lineage, and more.

2. Connecting to File-Based Storage Systems (Data Sources)

With Nexla's connectors, users can quickly and easily add any file-based storage system as a data source to begin ingesting, transforming, and moving data in any format. This section provides general instructions and information about connecting to file-based storage systems.

2.1 Create a New File-Based System Data Source

  1. After logging into Nexla, navigate to the Integrate section by selecting IntegrateIcon.png from the platform menu on the left side of the screen.

  2. Click NewDataFlow.png at the top of the Integrate toolbar on the left to open the Select Source Type screen.

NewDataFlow2.png
  1. Select the type of data flow to be created from the menu on the left, and click Create2.png to proceed to data source creation.

    File Systems Data Flow Types

    File system-based data sources can be used to create the FlexFlow, Replication, and Spark ETL data flow types.

    FlexFlow:
    FlexFlow is a flexible all-in-one data flow type that can be used to create both streaming and real-time data flows that can be used to transform data and/or move data from any source to any destination. This flow type uses the Kafka engine to facilitate seamless high-throughput movement of data from any source to any destination. FlexFlow is the recommended flow type for most workflows.

    Spark ETL:
    Spark ETL data flows are designed for rapidly modifying large volumes of data stored in cloud databases or Databricks and moving the transformed data into another cloud storage or Databricks location. This flow type uses the Apache Spark engine and is ideal for large-scale data processing wherein minimizing latency in data movement is a critical need.

    Replication:
    Replication data flows are designed for use in workflows that require high-speed movement of unmodified files between storage systems. This flow type is ideal for use when both retaining file structure and transferring data as quickly as possible are critical.

FlowTypes.png
  1. Select the connector tile that matches the file-based storage system from which data will be ingested in this flow. Then, click Next.png in the top right corner of the screen.

    Connector Categories

    To view all of Nexla's currently available file-based storage system connectors, select File Systems from the Categories toolbar on the left side of the screen.

    SelectConnector.png
  1. In the Authenticate screen, follow the instructions below to create or select the credential that will be used to connect to the data source.

    To create a new credential:

    1. Select the Add Credential tile.

      AddCredential.png
    2. Enter and/or select the required information in the Add New Credential pop-up.

      Adding New Credentials

      Credential requirements vary depending on the data source type, with some sources requiring URLs, access keys, and other parameters that must be obtained from the source application/location in addition to account information.

      Detailed information about credential creation for specific file-based sources can be found on the Connectors > File-Based Systems page.

      AddCredential2.png
    3. Once all of the required information has been entered, click SaveCredential.png at the bottom of the pop-up to save the new credential, and proceed to Section 2.2.


    To use a previously added or shared credential:

    1. Select the credential from the list.

      SelectCred.png
    2. Click Next.png in the upper right corner of the screen, and proceed to Section 2.2.


2.2 Configure the Data Source

  1. In the Configure screen, enter a name for the data source in the Source Name or Name field.

  2. Optional: Enter a brief description of the data source in the Description field (if present).

    Resource Descriptions

    Resource descriptions should provide information about the resource purpose, data freshness, etc. that can help the owner and other users efficiently understand and utilize the resource.


  3. The subsections below provide information about additional settings available for file system data sources in Nexla data flows. Follow the listed instructions to configure each setting for this data source, and then proceed to Section 2.3.

Source Directory/Table

In Nexla, file-system data sources can be configured to ingest all files accessible to the selected credential, only files within a specific folder, or a single file.

  • Under the Source Directory or Source Table section, navigate to the directory location from which Nexla will read files from this source; then, hover over the listing, and click the Select.png icon to select this location.

    • To view/select a nested location, click the Expand.png icon next to a listed folder to expand it.
SelectLoc1.png
  • The selected directory location is displayed at the top of the Source Directory/Source Table section.
SelectLoc2.png

Data Selection

When setting up a file system data source in a data flow, Nexla provides configuration options for specifying which data should be ingested from the source location, allowing users to customize data ingestion to suit various use cases. File modification dates, naming patterns, and/or subfolder paths can be used to specify which data should be ingested from the selected file system location.

The settings discussed in this section are located under the Data Selection category.

▷   To ingest all files in the selected location:

  • To configure Nexla to ingest all files from the selected location, regardless of when the files were added or modified, leave the Only read files modified after: field blank.

    Blank_AllFiles.png

▷   To ingest files according to the most recent modification date:

  1. When Nexla should only ingest newer or recently modified files from the data source, the platform can be configured to selectively ingest files modified after a specified date and time. To specify the file modification date and time that will be used to select which files should be read from this source, click the Calendar.png icon in the Only read files modified after: field under, and select the date from the dropdown calendar.
ModifiedAfter1.png
  1. In the field at the bottom of the calendar, enter the time (in 24-h format) on the selected date that should be referenced when identifying new and/or modified files from the source.
Time.png

Scheduling

Scan scheduling options can be used to define the freqency at which the selected location will be scanned for new data and/or changes in a data flow. Any new data/changes identified during a scan will then be ingested into the flow.

  • By default, when a new data flow is created, Nexla is configured to scan the source for data changes once every day. To continue with this setting, no further selections are required. Proceed to Section 2.3.

  • To define how often Nexla should scan the data source for new data changes, select an option from the Check for Files pulldown menu under the Scheduling section.

    • When options such as Every N Days or Every N Hours, a secondary pulldown menu will be populated. Select the appropriate value of N from this menu.
CheckForFiles.png
  • To specify the time at which Nexla should scan the source for new data changes, use the pulldown menu(s) to the right of the Check For Files menu. These time menus vary according to the selected scan frequency.
Time.png

Additional Settings

Additional data source settings are available for file based system sources, depending on the selected flow type. For information about these settings, see the corresponding flow type user guide:

2.3 Save & Activate the Data Source

Once all required settings and any desired additional options are configured, click Continue.png in the top right corner of the screen to save & activate the data source.

Continue2.png

Once the data source is created, Nexla will automatically scan it for data according to the configured settings. Identified data will be organized into a Nexset, which is a logical data product that is immediately ready to be sent to a destination.


2.4 How Nexla Organizes Data

When Nexla ingests data from a source, the platform intelligently analyzes the structure of the data to organize it into one or more Nexsets.

If a location containing multiple files is selected when configuring a data source from a file-based storage system, Nexla will examine the differences between the ingested files. The platform will create Nexsets containing the ingested data based on the level of overlap between records and options selected during data source creation.

After the initial data ingestion cycle, Nexla will repeat the process of comparing the structure and composition of data newly ingested in subsequent cycles to any existing Nexsets. Similar data will be added to existing Nexsets, while significantly different data will be organized into a new Nexset.

Important Note: File Ingestion

Nexla's comparison of ingested data to existing Nexsets ignores differences in file format.

For example, when a CSV file containing the headers ID and Name and a JSON file with ID and Name object properties are ingested, the data contained in both files will be processed into the same Nexset.


2.5 Ingestion of New and/or Modified Files

Once a data source has been created in Nexla, whether from a file-based storage system or any other type of service, the platform will scan the source at regular intervals according to the configured scheduling options. When Nexla detects new files during a scan, it will automatically ingest and process the data contained in the new files and mark the files as ingested.

Nexla also tracks the number of rows of data that have been ingested from each file. Therefore, when additional rows of data are added to a previously ingested file, the platform will automatically ingest and process the added data.

Important Note: Data Ingestion

Nexla reads and processes data from a source according to the configured schedule, but the platform will wait for a period of inactivity at the source before executing a scan.


2.6 Re-ingestion of Files

In some cases, a previously ingested file may need to be modified in a way that affects record values without adding new rows of data. When this occurs, the file should be marked for re-ingestion in the next scan cycle.

To re-ingest a file:

  1. Navigate to the Integrate screen by selecting Integrate.png from the platform menu on the left side of the screen.

  2. In the All Data Flows list, locate the flow origin corresponding to the file that should be re-ingested, and click on it to expand the flow view.

  3. Click the Details.png icon on the data source to open the Data Source information screen.

SourceInfo.png
  1. Select the Read Stats tab to view a list of files previously ingested from this source.
IngestedFiles.png
  1. Click the ReingestIcon.png icon to the right of the file that should be re-ingested, and click the Reingest.png button in the pop-up that appears.

    • When this button is clicked, the selected file will be re-ingested during the next ingestion cycle.
ReingestSeq.png

3. Sending Data to File-Based Storage Systems (Data Destinations)

Nexla's bi-directional connectors allow data to flow both to and from any location, making it simple to set up up a data flow that sends data to a file-based storage system. This section provides general instructions and information about sending data to file-based storage systems.

3.1 Select the Destination Type & Credential

  1. Locate the Nexset that will be sent to the file-based storage system destination.

    Viewing Accessible Nexsets

    To view all accessible Nexsets within their associated data flows:

    Navigate to the Integrate section, and select All Data Flows from the menu on the left. Then, click on any listed data flow to view all detected and transformed Nexsets that it contains.

    To view a list of all Nexsets accessible to the Nexla user account:

    Navigate to the Integrate section, and select Nexsets from the menu on the left to open the Nexsets screen.


  2. Click the Send.png icon on the Nexset.

SendNexset.png
  1. In the Connect screen, select the connector tile matching the data destination type from the list.

    SelectDest.png
  2. Add or select the credential that should be used to connect to the destination by following the same instructions shown in Section 2.1, Step 5.


3.2 Configure the Destinaton

  1. In the Configure screen, enter a name for the destination in the Destination Name or Name field.

  2. Optional: Enter a brief description of the destination in the Description field (if present).

    Resource Descriptions

    Resource descriptions should provide information about the resource purpose, data freshness, etc. that can help the owner and other users efficiently understand and utilize the resource.


  3. The subsections below provide information about additional settings available for data destinations in Nexla data flows. Follow the listed instructions to configure each setting for this data source, and then proceed to Section 3.3.


Destination Directory/Table

  • Under the Destination Directory or Destiination Table section, navigate to the directory location to which Nexla will send the Nexset data; then, hover over the listing, and click the Select.png icon to select this location.

    • To view/select a nested location, click the Expand.png icon next to a listed folder to expand it.
DestDirectory.png
  • The selected directory location is displayed at the top of the Destination Directory/Destination Table section.
DestDirectory2.png

Additional Settings

Additional data source settings are available for file based system sources, depending on the selected flow type. For information about these settings, see the corresponding flow type user guide:

3.3 Save and Activate the Destination

  1. Once all required settings and any desired additional options are configured, click Done.png in the top right corner of the screen to save the data destination.

    Important: Data Movement

    Data will not begin to flow into the destination until it is activated by following the instructions below.

Done2.png
  1. To activate the destination, click the Edit.png icon on the destination, and select Activate.png from the dropdown menu.
Activate3.png