Reading Data

Data Sources

Data source resources describe sources of data to be ingested, including details about source type, ingestion schedule, associated data sets, and credentials for accessing the source. No matter where the data to be ingested resides, all information where, when, and how to ingest the data is contained in these Nexla resources.

A data source may have one or more datasets associated with it. These correspond to distinct schemas detected by Nexla in the source.

List All Sources

Both Nexla API and Nexla CLI support methods to list all sources in the authenticated user's account. A successful call returns detailed information like id, owner, type, credentials, activation status, and ingestion configuration about all sources.

Nexla API
Nexla CLI

List All Sources: Request
GET /data_sources

Example:
curl https://api.nexla.io/data_sources          \
  -H "Authorization: Bearer <Access-Token>"     \
  -H "Accept: application/vnd.nexla.api.v1+json"

List All Sources: Request

nexla source list

Nexla API
Nexla CLI

List All Sources: Response
[
  {
    "id": 5002,
    "owner": {
      "id": 2,
      "full_name": "Jeff Williams"
    },
    "org": {
      "id": 1,
      "name": "Nexla"
    },
    "access_roles": ["owner"],
    "name": "Example data source",
    "description": null,
    "status": "ACTIVE",
    "data_sets": [
      {
        "id": 5004,
        "version": 1,
        "name": null,
        "description": null,
        "updated_at": "2016-10-28T21:48:15.000Z",
        "created_at": "2016-10-28T21:48:15.000Z"
      }
    ],
    "ingest_method": "API",
    "source_type": "api_push",
    "source_format": "JSON",
    "source_config": null,
    "poll_schedule": null,
    "data_credentials": null,
    "updated_at": "2016-10-28T21:48:15.000Z",
    "created_at": "2016-10-28T21:48:15.000Z"
  },
  {
    "id": 5003,
    "owner": {
      "id": 2,
      "full_name": "Jeff Williams"
    },
    "org": {
      "id": 1,
      "name": "Nexla"
    },
    "access_roles": ["owner"],
    "name": "Example Lat/Lng source",
    "description": null,
    "status": "ACTIVE",
    "data_sets": [
      {
        "id": 5011,
        "version": 1,
        "name": null,
        "description": null,
        "updated_at": "2016-10-28T21:48:15.000Z",
        "created_at": "2016-10-28T21:48:15.000Z"
      }
    ],
    "ingest_method": "POLL",
    "source_type": "s3",
    "source_format": "JSON",
    "source_config": null,
    "poll_schedule": null,
    "data_credentials": {
      "id": 5001,
      ...
    },
    "updated_at": "2016-10-28T21:48:15.000Z",
    "created_at": "2016-10-28T21:48:15.000Z"
  }
]

List All Sources: Response
id        status        source_type      name        location                           credentials_name
------    ----------    ------------     --------    -----------------------------      ------------------
8874      PAUSED        ftp              test        ftp://test-regression/test         sftp_test

9989      ACTIVE        s3               test1       s3://test-nexla.com/test           s3_test

Show One Source

Fetch a specific source accessible to the authenticated user. A successful call returns detailed information like id, owner, type, credentials, activation status, and ingestion configuration about that source.

In case of Nexla API, add an expand query param with a truthy value to get more details about the source. With this parameter, full details about the related resources (detected datasets, credentials, etc) will also be returned.

Nexla API
Nexla CLI

Show One Source: Request
GET /data_sources/{data_source_id}

Example
curl https://api.nexla.io/data_sources/5003         \
  -H "Authorization: Bearer <Access-Token>"         \
  -H "Accept: application/vnd.nexla.api.v1+json"

Show One Source: Request
nexla source get <source_id>

Nexla API
Nexla CLI

Show One Source: Response
{
  "id": 5003,
  "owner": {
    "id": 2,
    "full_name": "Jeff Williams"
  },
  "org": {
    "id": 1,
    "name": "Nexla"
  },
  "access_roles": ["owner"],
  "name": "Example Lat/Lng source",
  "description": null,
  "status": "ACTIVE",
  "data_sets": [
    {
      "id": 5011,
      "version": 1,
      "name": null,
      "description": null,
      "updated_at": "2016-10-28T21:48:15.000Z",
      "created_at": "2016-10-28T21:48:15.000Z"
    }
  ],
  "ingest_method": "POLL",
  "source_type": "s3",
  "source_format": "JSON",
  "source_config": null,
  "poll_schedule": null,
  "data_credentials": {
    "id": 5001,
    ...
    "updated_at": "2016-10-28T21:48:15.000Z",
    "created_at": "2016-10-28T21:48:15.000Z"
  },
  "updated_at": "2016-10-28T21:48:15.000Z",
  "created_at": "2016-10-28T21:48:15.000Z"
}

Show One Source: Response
{
  "source_type": "ftp",
  "source_config": {
    "bucket": "/",
    "prefix": "demo/source/test",
    "start.cron": "0 55 10 1/1 * ? *",
    "advanced_settings": "Auto Detect",
    "path_exclusions": false,
    "schema.detection.once": true,
    "allowGrouping": false
  },
  "data_credentials": "<5794: sftp_cs>",
  "name": "test",
  "ingest_method": "POLL"
}

Create A Source

Both Nexla API and Nexla CLI support methods to create a new data source in the authenticated user's account. The only required attribute in the input object is the data source name; all other attributes are set to default values.

Nexla API
Nexla CLI

Create Source: Request
POST /data_sources

Example Request Body
...
{
  "name": "Example S3 Data Source",
  "source_type": "s3"
}

Create Source: Request
nexla source create --payload='
{
  "name": "Example S3 Data Source",
  "source_type": "s3"
}'

Nexla API
Nexla CLI

Create Source: Response
{
  "id": 5023,
  "owner": {
    "id": 2,
    "full_name": "Jeff Williams"
  },
  "org": {
    "id": 1,
    "name": "Nexla"
  },
  "access_roles": ["owner"],
  "name": "Example S3 Data Source",
  "description": null,
  "status": "INIT",
  "data_sets": [],
  "ingest_method": "PULL",
  "source_type": "s3",
  "source_format": "JSON",
  "source_config": null,
  "poll_schedule": null,
  "data_credentials": null,
  "updated_at": "2016-12-06T20:18:58.662Z",
  "created_at": "2016-12-06T20:18:58.662Z"
}

Create Source: Response
Successfully created source, Created Source ID is --> 7794

Create with Credentials

Data sources usually require some credentials for making a connection and ingesting data. You can refer to an existing data_credentials resource or create a new one in the POST call to /data_sources. In this example, an existing credentials object is used:

Nexla API

Create with Credentials: Request
POST /data_sources

Example Request Body
...
{
  "name": "Example S3 Data Source",
  "source_type": "s3",
  "data_credentials": 5001
}

Here, the required attributes for creating a new data_credentials resource are included in the request:

Nexla API

Create with Credentials: Request
POST /data_sources

Example Request Body
...
{
  "name": "Example FTP Data Source",
  "source_type": "ftp",
  "data_credentials": {
    "name": "FTP CREDS",
    "credentials_type": "ftp",
    "credentials_version": "1",
    "credentials": {
      "credentials_type": "ftp",
      "account_id": "XYZ",
      "password": "123"
    }
  }
}

In either case, a successful POST on /data_sources with credential information will return a response including the full data source and the encrypted form of its associated data credentials resource:

Nexla API

Create with Credentials: Response
{
  "id": 5023,
  "owner": {
    "id": 2,
    "full_name": "Jeff Williams"
  },
  "org": {
    "id": 1,
    "name": "Nexla"
  },
  "access_roles": ["owner"],
  "name": "Updated S3 Data Source",
  "description": null,
  "status": "INIT",
  "data_sets": [],
  "ingest_method": "PULL",
  "source_type": "s3",
  "source_format": "JSON",
  "source_config": null,
  "poll_schedule": null,
  "data_credentials": null,
  "updated_at": "2016-13-06T20:18:58.662Z",
  "created_at": "2016-12-06T20:18:58.662Z"
}

Update A Source

Nexla API supports methods to update any property of an existing source the authenticated user has access to.

Nexla API

Update Source: Request
PUT /data_sources/5023

Example Request Body
...
{
  "name": "Updated S3 Data Source",
}

Nexla API

Update Source: Response
{
  "id": 5023,
  "owner": {
    "id": 2,
    "full_name": "Jeff Williams"
  },
  "org": {
    "id": 1,
    "name": "Nexla"
  },
  "access_roles": ["owner"],
  "name": "Updated S3 Data Source",
  "description": null,
  "status": "INIT",
  "data_sets": [],
  "ingest_method": "PULL",
  "source_type": "s3",
  "source_format": "JSON",
  "source_config": null,
  "poll_schedule": null,
  "data_credentials": null,
  "updated_at": "2016-13-06T20:18:58.662Z",
  "created_at": "2016-12-06T20:18:58.662Z"
}

Delete A Source

Nexla API supports methods to delete any source that the authenticated user has administrative/ownership rights to.

If the source is paused and none of its detected datasets have associated downstream resources Nexla can delete the source safely. A successful request to delete a data source returns Ok (200) with no response body.

If the source is active or there are downstream resources that will be impacted Nexla will not trigger deletion and instead return a failure message informing about the reason for denying deletion of the source.

Nexla API

Delete Source: Request
DELETE /data_sources/{data_source_id}

Nexla API

Delete Source: Response
Empty response with status 200 for success
Error response with reason if source could not be deleted

Control ingestion

Activate and Pause Source

Trigger Nexla to start ingesting data immediately by calling the activation method on that source. Note that Nexla source usually contains parameters to schedule automatic ingestion based on cron intervals or completion of other jobs. This activation method triggers an ingestion in addition to the scheduled automatic source ingestion.

Nexla API
Nexla CLI

Activate Source: Request
PUT /data_sources/{data_source_id}/activate

Activate Source: Request
nexla source activate <source_id>

On the flip side, call the pause method to immediately stop ingestion on that source. Any subsequent scheduled ingestion intervals will be ignored as long as the source is paused.

Nexla API
Nexla CLI

Pause Source: Request
PUT /data_sources/{data_source_id}/pause

Pause Source: Request
nexla source pause <source_id>

Re-ingest Files

For file type sources, Nexla can be configured to re-ingested an already scanned file. This is useful if the file originally failed ingestion due to file errors and the file has been modified.

To re-ingest files for a data source, issue POST request on endpoint /data_sources/<data_source_id>/file/ingest with file path as body. The file path must start with the root of the location that the source points to.

Nexla API

Re-ingest File: Request
POST /data_sources/{data_source_id}/file/ingest
...
Example Payload
{"file":"xls-merge/PostLog_TableOnlyXLS.xlsx"}

Nexla API

Re-ingest File: Response
{
  "status": "ok"
}

Validate Source Configuration

All configuration about where and when to scan data is contained with the source_config property of a data source.

As Nexla provides quite a few options to fine tune and control exactly what slice of your data location you want to ingest and how, it is important to ensure the source_config contains all required parameters to successfully scan data. To validate the configuration of a given data source, send a POST request on endpoint /data_sources/<data_source_id>/config/validate.

You can send optional json config as input body, if there is no input config in request then stored source_config will be used for validation.

Nexla API

Validate Source Configuration: Request
POST /data_sources/{data_source_id}/config/validate

Nexla API

Validate Source Configuration: Response
{
  "status": "ok",
  "output": [
    {
      "name": "credsEnc",
      "value": null,
      "errors": [
        "Missing required configuration \"credsEnc\" which has no default value."
      ],
      "visible": true,
      "recommendedValues": []
    },
    {
      "name": "credsEncIv",
      "value": null,
      "errors": [
        "Missing required configuration \"credsEncIv\" which has no default value."
      ],
      "visible": true,
      "recommendedValues": []
    },
    {
      "name": "source_type",
      "value": null,
      "errors": [
        "Missing required configuration \"source_type\" which has no default value.",
        "Invalid value null for configuration source_type: Invalid enumerator"
      ],
      "visible": true,
      "recommendedValues": []
    }
  ]
}

Inspect Source Data

You can inspect the data that a source points to. These methods can be handy when trying to figure out the exact source_config properties to be set in the data source.

Inspect Source Content Hierarchy

You can inspect the tree structure of file and database sources to a particular depth. Note that not all data source types have a natural tree structure.

The following example shows the required request body structure for a /probe/tree call on an S3 data source.

Nexla API

Inspect Source Content Hierarchy: Request
POST /data_sources/<source_id>/probe/tree
...
{
 "region": "us-west-1",
 "bucket": "production-s3-basin",
 "prefix": "events_v2/",
 "depth": 3
}

Nexla API

Inspect Source Content Hierarchy: Response
{
  "status": "ok",
  "output": {
    "events_v2": {
      "2015": {
        "11": {
          "1": {},
          "2": {},
          "3": {}
        },
        "12": {
          "1": {},
          "8": {}
        }
      },
      "2017": {
        "2": {
          "20": {}
        }
      }
    }
  }
}

Inspect Sample File Content

You can also get metadata and sample content from a file within a source. Note that the request payload must contain path of file starting from the root of the location that data_source points to.

Nexla API

Inspect File Content: Request
POST /data_sources/{data_source_id}/probe/files
...
{
  "path" : "demo-in.nexla.com/test/Stock.json"
}

Nexla API

Inspect File Content: Response
{
  "status": 200,
  "message": "Ok",
  "output": {
    "format": "json",
    "messages": [
      {
        "Stockname": "sociosqu ad",
        "Total Debt": 8,
        "Return on Assets": 1,
        "Sector": "Mauris",
        "Quick Ration": "1.2"
      },
      {
        "Stockname": "ornare. In",
        "Total Debt": 7,
        "Return on Assets": 3,
        "Sector": "lectus.",
        "Quick Ration": "1.2"
      },
      {
        "Stockname": "nec, diam.",
        "Total Debt": 5,
        "Return on Assets": 5,
        "Sector": "eu",
        "Quick Ration": "1.2"
      }
    ]
  },
  "connection_type": "s3"
}

Monitor Source

Use the methods listed in this section to monitor all ingestion history for a source.

Lifetime Ingestion Metrics

Lifetime ingestion metrics methods return information about total data ingested for a source since its creation. Metrics contain information about the number of records ingested as well the estimated volume of data.

Nexla API

Lifetime Ingestion Metrics: Request
GET /data_sources/5001/metrics

Nexla API

Lifetime Ingestion Metrics: Response
{
  "status": 200,
  "metrics": {
    "records": 4,
    "size": 582
  }
}

Aggregated Ingestion Metrics

Aggregated ingestion metrics methods return information about total data ingested every day for a source. Metrics contain information about the number of records ingested as well the estimated volume of data.

Aggregations can be fetched in different aggregation units. Use the method below to fetch reports aggregated daily:

Nexla API
Nexla CLI

Daily Ingestion Metrics: Request
GET /data_sources/5001/metrics?aggregate=1

...
Optional Query Parameters:

  "from": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
  "to": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
  "page": <integer page number>,
  "size": <number of entries in page>

Daily Ingestion Metrics: Request
nexla source metrics 8874
...
Optional Payload Parameters
-d,--days (int) Number of days ago to get the metrics, default is 7
-s,--start (string) UTC datetime in '%Y-%m-%dT%H:%M:%S' format
-e,--end (string) UTC datetime in '%Y-%m-%dT%H:%M:%S' format, default is current time. -s/--start required for this option.

Nexla API
Nexla CLI

Daily Ingestion Metrics: Response
{
  "status": 200,
  "metrics": [
    {
      "time": "2017-02-08",
      "record": 53054,
      "size": 12476341
    },
    {
      "time": "2017-02-09",
      "record": 66618,
      "size": 15829589
    },
    {
      "time": "2017-02-10",
      "record": 25832,
      "size": 6645994
    }
  ]
}

Daily Ingestion Metrics: Response
Date (UTC)      Records    Volume (Bytes)    Errors
------------  ---------  ----------------  --------
2019-04-25      81577         410348843         0
2019-04-26      97350         460260701         0
2019-04-27      85675         392488855         0
2019-04-28      85646         391447623         0

Sources can be configured to scan for data at a specific ingestion frequency. Use the methods below to view ingestion metrics per ingestion cycle.

Nexla API
Nexla CLI

Aggregated By Ingestion Frequency: Request
GET /data_sources/5001/metrics/run_summary
...
Optional Query Parameters:

  "runId": <starting from unix epoch time of ingestion events>,
  "from": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
  "to": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
  "page": <integer page number>,
  "size": <number of entries in page>

Aggregated By Ingestion Frequency: Request
nexla source metrics 6864

...
Optional Payload Parameters
-d,--days (int) Number of days ago to get the metrics, default is 7
-s,--start (string) UTC datetime in '%Y-%m-%dT%H:%M:%S' format
-e,--end (string) UTC datetime in '%Y-%m-%dT%H:%M:%S' format, default is current time. -s/--start required for this option.

Nexla API
Nexla CLI

Aggregated By Ingestion Frequency: Response
{
  "status": 200,
  "metrics": {
    "1539970426049": {
      "records": 1364,
      "size": 971330,
      "errors": 0
    },
    "1539990426049": {
      "records": 330,
      "size": 235029,
      "errors": 0
    }
  }
}

Aggregated By Ingestion Frequency: Response
Date (UTC)      Records    Volume (Bytes)    Errors
------------  ---------  ----------------  --------
2020-04-21            9             12598         0

Granular Ingestion Status Metrics

Apart from aggregated ingestion metrics methods above that provide visibility into total number of records and total volume of data ingested over a period of time, Nexla also provides methods to view granular details about ingestion events.

You can retrieve ingestion status of a file source to find information like how many files have been read fully, failed ingestion, or queued for ingestion in next ingestion cycle.

Nexla API

File Source Ingestion Status: Request
GET /data_sources/5001/metrics/files_stats
...
Optional Parameters
{
  "from": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
  "to": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
  "status": "one of NOT_STARTED/IN_PROGRESS/COMPLETE/ERROR/PARTIAL"
}

Nexla API

File Source Ingestion Status: Response
{
  "status": 200,
  "metrics": {
    "data": {
      "COMPLETE": 17
    },
    "meta": {
      "currentPage": 1,
      "totalCount": 1,
      "pageCount": 1
    }
  }
}

You can view ingestion status and history per file of a file source. The file source ingestion history methods below return one entry per file by aggregating all ingestion events for each file.

Nexla API
Nexla CLI

Ingestion History Per File: Request
/data_sources/5001/metrics/files
...
Optional Parameters
{
  "from": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
  "to": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
  "status": "one of NOT_STARTED/IN_PROGRESS/COMPLETE/ERROR/PARTIAL",
  "page": <integer page number>,
  "size": <number of entries in page>
}

Ingestion History Per File: Request
nexla source read-stats <source_id> [options]
...
options
-d,--days (int) Number of days ago to get the metrics, default is 7

-s,--start (string) UTC datetime in '%Y-%m-%dT%H:%M:%S' format

-e,--end (string) UTC datetime in '%Y-%m-%dT%H:%M:%S' format, default is current time. -s/--start required for this option.

Nexla API
Nexla CLI

Ingestion History Per File: Response
{
  "status": 200,
  "metrics": {
    "data": [
      {
        "dataSourceId": 1110,
        "dataSetId": 2881,
        "size": 436180,
        "ingestionStatus": "COMPLETE",
        "recordCount": 1000,
        "name": "/2017/05/05/22/sub-in-5038-00000-000000000000.json",
        "id": null,
        "lastModified": "2018-06-04T11:31:24Z",
        "error": null,
        "lastIngested": "2018-06-04T11:43:11Z",
        "errorCount": null
      },
      {
        "dataSourceId": 1110,
        "dataSetId": 2881,
        "size": 423605,
        "ingestionStatus": "COMPLETE",
        "recordCount": 1000,
        "name": "/2017/05/05/22/sub-in-5038-00000-0000000000001.json",
        "id": null,
        "lastModified": "2018-06-04T11:31:27Z",
        "error": null,
        "lastIngested": "2018-06-04T11:43:04Z",
        "errorCount": null
      }
    ],
    "meta": {
      "currentPage": 2,
      "totalCount": 12,
      "pageCount": 2
    }
  }
}

Ingestion History Per File: Response
File Name             Size    Records  Status      Dataset ID  Last Ingested (UTC)
-------------------- -------  ---------  --------  ------------  ---------------------
/files/demo-1.csv   501689        792  COMPLETE          7751  2019-05-03T21:57:25Z
/files/demo-2.csv  2789267       4383  COMPLETE          7751  2019-05-03T21:57:22Z

You can also bypass per file aggregation and fetch full ingestion history of each file even if it was scanned multiple times.

Nexla API

Raw File Ingestion Status: Request
GET /data_sources/5001/metrics/files_raw?from=2017-09-25T02:25:26&to=2017-09-28T02:25:26
...
Optional Parameters
{
  "from": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
  "to": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
  "status": "one of NOT_STARTED/IN_PROGRESS/COMPLETE/ERROR/PARTIAL",
  "page": <integer page number>,
  "size": <number of entries in page>
}

Nexla API

Raw File Ingestion Status: Response
{
  "status": 200,
  "metrics": [
    {
      "dataSourceId": 1542,
      "dataSetId": 4062,
      "size": 3692,
      "ingestionStatus": "COMPLETE",
      "recordCount": 25,
      "name": "12-06-2018/Source/D/Books_blank.json",
      "id": 6124681,
      "lastModified": "2018-06-12T06:56:11Z",
      "error": null,
      "lastIngested": "2018-06-28T21:02:36Z",
      "errorCount": null
    },
    {
      "dataSourceId": 1542,
      "dataSetId": null,
      "size": 0,
      "ingestionStatus": "NOT_STARTED",
      "recordCount": 0,
      "name": "12-06-2018/Source/D/Books_blank.json",
      "id": 6124680,
      "lastModified": "2018-06-12T06:56:11Z",
      "error": null,
      "lastIngested": "2018-06-28T21:02:23Z",
      "errorCount": null
    },
    {
      "dataSourceId": 1542,
      "dataSetId": null,
      "size": 0,
      "ingestionStatus": "NOT_STARTED",
      "recordCount": 0,
      "name": "12-06-2018/Source/D/Books_blank.json",
      "id": 6124679,
      "lastModified": "2018-06-12T06:56:11Z",
      "error": null,
      "lastIngested": "2018-06-28T21:02:19Z",
      "errorCount": null
    }
  ]
}

You can call the methods below to retrieve source ingestion status per ingestion poll cycle.

Nexla API

Ingestion Status By Frequency: Request
GET /data_sources/5003/metrics/files_cron
...
Optional Parameters
{
  "from": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
  "to": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
  "status": "one of NOT_STARTED/IN_PROGRESS/COMPLETE/ERROR/PARTIAL",
  "page": <integer page number>,
  "size": <number of entries in page>
}

Nexla API

Ingestion Status By Frequency: Response
{
  "data": [
    {
      "dataSourceId": 5003,
      "dataSetId": null,
      "size": 2064,
      "ingestionStatus": "COMPLETE",
      "recordCount": 5,
      "name": null,
      "id": null,
      "lastModified": "2018-09-20T04:56:44Z",
      "runId": 1537394123916,
      "error": null,
      "lastIngested": "2018-09-20T04:57:13Z",
      "errorCount": null
    }
  ],
  "meta": {
    "currentPage": 1,
    "totalCount": 1,
    "pageCount": 1
  }
}

Other Monitoring Events

See the section on Monitoring resources for method to view source errors, notifications, quarantine samples, and audit logs.

Data Sources​

List All Sources​

Show One Source​

Create A Source​

Create with Credentials​

Update A Source​

Delete A Source​

Control ingestion​

Activate and Pause Source​

Re-ingest Files​

Validate Source Configuration​

Inspect Source Data​

Inspect Source Content Hierarchy​

Inspect Sample File Content​

Monitor Source​

Lifetime Ingestion Metrics​

Aggregated Ingestion Metrics​

Granular Ingestion Status Metrics​

Other Monitoring Events​