Reading Data
Data Sources
Data source resources describe sources of data to be ingested, including details about source type, ingestion schedule, associated data sets, and credentials for accessing the source. No matter where the data to be ingested resides, all information where, when, and how to ingest the data is contained in these Nexla resources.
A data source may have one or more datasets associated with it. These correspond to distinct schemas detected by Nexla in the source.
List All Sources
Both Nexla API and Nexla CLI support methods to list all sources in the authenticated user's account. A successful call returns detailed information like id, owner, type, credentials, activation status, and ingestion configuration about all sources.
- Nexla API
- Nexla CLI
GET /data_sources
Example:
curl https://api.nexla.io/data_sources \
-H "Authorization: Bearer <Access-Token>" \
-H "Accept: application/vnd.nexla.api.v1+json"
nexla source list
- Nexla API
- Nexla CLI
[
{
"id": 5002,
"owner": {
"id": 2,
"full_name": "Jeff Williams"
},
"org": {
"id": 1,
"name": "Nexla"
},
"access_roles": ["owner"],
"name": "Example data source",
"description": null,
"status": "ACTIVE",
"data_sets": [
{
"id": 5004,
"version": 1,
"name": null,
"description": null,
"updated_at": "2016-10-28T21:48:15.000Z",
"created_at": "2016-10-28T21:48:15.000Z"
}
],
"ingest_method": "API",
"source_type": "api_push",
"source_format": "JSON",
"source_config": null,
"poll_schedule": null,
"data_credentials": null,
"updated_at": "2016-10-28T21:48:15.000Z",
"created_at": "2016-10-28T21:48:15.000Z"
},
{
"id": 5003,
"owner": {
"id": 2,
"full_name": "Jeff Williams"
},
"org": {
"id": 1,
"name": "Nexla"
},
"access_roles": ["owner"],
"name": "Example Lat/Lng source",
"description": null,
"status": "ACTIVE",
"data_sets": [
{
"id": 5011,
"version": 1,
"name": null,
"description": null,
"updated_at": "2016-10-28T21:48:15.000Z",
"created_at": "2016-10-28T21:48:15.000Z"
}
],
"ingest_method": "POLL",
"source_type": "s3",
"source_format": "JSON",
"source_config": null,
"poll_schedule": null,
"data_credentials": {
"id": 5001,
...
},
"updated_at": "2016-10-28T21:48:15.000Z",
"created_at": "2016-10-28T21:48:15.000Z"
}
]
id status source_type name location credentials_name
------ ---------- ------------ -------- ----------------------------- ------------------
8874 PAUSED ftp test ftp://test-regression/test sftp_test
9989 ACTIVE s3 test1 s3://test-nexla.com/test s3_test
Show One Source
Fetch a specific source accessible to the authenticated user. A successful call returns detailed information like id, owner, type, credentials, activation status, and ingestion configuration about that source.
In case of Nexla API, add an expand
query param with a truthy value to get more details about the source. With this parameter, full details about the related resources (detected datasets, credentials, etc) will also be returned.
- Nexla API
- Nexla CLI
GET /data_sources/{data_source_id}
Example
curl https://api.nexla.io/data_sources/5003 \
-H "Authorization: Bearer <Access-Token>" \
-H "Accept: application/vnd.nexla.api.v1+json"
nexla source get <source_id>
- Nexla API
- Nexla CLI
{
"id": 5003,
"owner": {
"id": 2,
"full_name": "Jeff Williams"
},
"org": {
"id": 1,
"name": "Nexla"
},
"access_roles": ["owner"],
"name": "Example Lat/Lng source",
"description": null,
"status": "ACTIVE",
"data_sets": [
{
"id": 5011,
"version": 1,
"name": null,
"description": null,
"updated_at": "2016-10-28T21:48:15.000Z",
"created_at": "2016-10-28T21:48:15.000Z"
}
],
"ingest_method": "POLL",
"source_type": "s3",
"source_format": "JSON",
"source_config": null,
"poll_schedule": null,
"data_credentials": {
"id": 5001,
...
"updated_at": "2016-10-28T21:48:15.000Z",
"created_at": "2016-10-28T21:48:15.000Z"
},
"updated_at": "2016-10-28T21:48:15.000Z",
"created_at": "2016-10-28T21:48:15.000Z"
}
{
"source_type": "ftp",
"source_config": {
"bucket": "/",
"prefix": "demo/source/test",
"start.cron": "0 55 10 1/1 * ? *",
"advanced_settings": "Auto Detect",
"path_exclusions": false,
"schema.detection.once": true,
"allowGrouping": false
},
"data_credentials": "<5794: sftp_cs>",
"name": "test",
"ingest_method": "POLL"
}
Create A Source
Both Nexla API and Nexla CLI support methods to create a new data source in the authenticated user's account. The only required attribute in the input object is the data source name; all other attributes are set to default values.
- Nexla API
- Nexla CLI
POST /data_sources
Example Request Body
...
{
"name": "Example S3 Data Source",
"source_type": "s3"
}
nexla source create --payload='
{
"name": "Example S3 Data Source",
"source_type": "s3"
}'
- Nexla API
- Nexla CLI
{
"id": 5023,
"owner": {
"id": 2,
"full_name": "Jeff Williams"
},
"org": {
"id": 1,
"name": "Nexla"
},
"access_roles": ["owner"],
"name": "Example S3 Data Source",
"description": null,
"status": "INIT",
"data_sets": [],
"ingest_method": "PULL",
"source_type": "s3",
"source_format": "JSON",
"source_config": null,
"poll_schedule": null,
"data_credentials": null,
"updated_at": "2016-12-06T20:18:58.662Z",
"created_at": "2016-12-06T20:18:58.662Z"
}
Successfully created source, Created Source ID is --> 7794
Create with Credentials
Data sources usually require some credentials for making a connection and ingesting data. You can refer to an existing data_credentials resource or create a new one in the POST call to /data_sources. In this example, an existing credentials object is used:
- Nexla API
POST /data_sources
Example Request Body
...
{
"name": "Example S3 Data Source",
"source_type": "s3",
"data_credentials": 5001
}
Here, the required attributes for creating a new data_credentials resource are included in the request:
- Nexla API
POST /data_sources
Example Request Body
...
{
"name": "Example FTP Data Source",
"source_type": "ftp",
"data_credentials": {
"name": "FTP CREDS",
"credentials_type": "ftp",
"credentials_version": "1",
"credentials": {
"credentials_type": "ftp",
"account_id": "XYZ",
"password": "123"
}
}
}
In either case, a successful POST
on /data_sources
with credential information will return a response including the full data source and the encrypted form of its associated data credentials resource:
- Nexla API
{
"id": 5023,
"owner": {
"id": 2,
"full_name": "Jeff Williams"
},
"org": {
"id": 1,
"name": "Nexla"
},
"access_roles": ["owner"],
"name": "Updated S3 Data Source",
"description": null,
"status": "INIT",
"data_sets": [],
"ingest_method": "PULL",
"source_type": "s3",
"source_format": "JSON",
"source_config": null,
"poll_schedule": null,
"data_credentials": null,
"updated_at": "2016-13-06T20:18:58.662Z",
"created_at": "2016-12-06T20:18:58.662Z"
}
Update A Source
Nexla API supports methods to update any property of an existing source the authenticated user has access to.
- Nexla API
PUT /data_sources/5023
Example Request Body
...
{
"name": "Updated S3 Data Source",
}
- Nexla API
{
"id": 5023,
"owner": {
"id": 2,
"full_name": "Jeff Williams"
},
"org": {
"id": 1,
"name": "Nexla"
},
"access_roles": ["owner"],
"name": "Updated S3 Data Source",
"description": null,
"status": "INIT",
"data_sets": [],
"ingest_method": "PULL",
"source_type": "s3",
"source_format": "JSON",
"source_config": null,
"poll_schedule": null,
"data_credentials": null,
"updated_at": "2016-13-06T20:18:58.662Z",
"created_at": "2016-12-06T20:18:58.662Z"
}
Delete A Source
Nexla API supports methods to delete any source that the authenticated user has administrative/ownership rights to.
If the source is paused and none of its detected datasets have associated downstream resources Nexla can delete the source safely. A successful request to delete a data source returns Ok (200) with no response body.
If the source is active or there are downstream resources that will be impacted Nexla will not trigger deletion and instead return a failure message informing about the reason for denying deletion of the source.
- Nexla API
DELETE /data_sources/{data_source_id}
- Nexla API
Empty response with status 200 for success
Error response with reason if source could not be deleted
Control ingestion
Activate and Pause Source
Trigger Nexla to start ingesting data immediately by calling the activation method on that source. Note that Nexla source usually contains parameters to schedule automatic ingestion based on cron intervals or completion of other jobs. This activation method triggers an ingestion in addition to the scheduled automatic source ingestion.
- Nexla API
- Nexla CLI
PUT /data_sources/{data_source_id}/activate
nexla source activate <source_id>
On the flip side, call the pause method to immediately stop ingestion on that source. Any subsequent scheduled ingestion intervals will be ignored as long as the source is paused.
- Nexla API
- Nexla CLI
PUT /data_sources/{data_source_id}/pause
nexla source pause <source_id>
Reingest Files
For file type sources, Nexla can be configured to reingested an already scanned file. This is useful if the file originally failed ingestion due to file errors and the file has been modified.
To re-ingest files for a data source, issue POST
request on endpoint /data_sources/<data_source_id>/file/ingest
with file path as body. The file path must start with the root of the location that the source points to.
- Nexla API
POST /data_sources/{data_source_id}/file/ingest
...
Example Payload
{"file":"xls-merge/PostLog_TableOnlyXLS.xlsx"}
- Nexla API
{
"status": "ok"
}
Validate Source Configuration
All configuration about where and when to scan data is contained with the source_config
property of a data source.
As Nexla provides quite a few options to fine tune and control exactly what slice of your data location you want to ingest and how, it is important to ensure the source_config
contains all required parameters to successfully scan data. To validate the configuration of a given data source, send a POST
request on endpoint /data_sources/<data_source_id>/config/validate
.
You can send optional json config as input body, if there is no input config in request then stored source_config will be used for validation.
- Nexla API
POST /data_sources/{data_source_id}/config/validate
- Nexla API
{
"status": "ok",
"output": [
{
"name": "credsEnc",
"value": null,
"errors": [
"Missing required configuration \"credsEnc\" which has no default value."
],
"visible": true,
"recommendedValues": []
},
{
"name": "credsEncIv",
"value": null,
"errors": [
"Missing required configuration \"credsEncIv\" which has no default value."
],
"visible": true,
"recommendedValues": []
},
{
"name": "source_type",
"value": null,
"errors": [
"Missing required configuration \"source_type\" which has no default value.",
"Invalid value null for configuration source_type: Invalid enumerator"
],
"visible": true,
"recommendedValues": []
}
]
}
Inspect Source Data
You can inspect the data that a source points to. These methods can be handy when trying to figure out the exact source_config
properties to be set in the data source.
Inspect Source Content Hierarchy
You can inspect the tree structure of file and database sources to a particular depth. Note that not all data source types have a natural tree structure.
The following example shows the required request body structure for a /probe/tree call on an S3 data source.
- Nexla API
POST /data_sources/<source_id>/probe/tree
...
{
"region": "us-west-1",
"bucket": "production-s3-basin",
"prefix": "events_v2/",
"depth": 3
}
- Nexla API
{
"status": "ok",
"output": {
"events_v2": {
"2015": {
"11": {
"1": {},
"2": {},
"3": {}
},
"12": {
"1": {},
"8": {}
}
},
"2017": {
"2": {
"20": {}
}
}
}
}
}
Inspect Sample File Content
You can also get metadata and sample content from a file within a source. Note that the request payload must contain path of file starting from the root of the location that data_source points to.
- Nexla API
POST /data_sources/{data_source_id}/probe/files
...
{
"path" : "demo-in.nexla.com/test/Stock.json"
}
- Nexla API
{
"status": 200,
"message": "Ok",
"output": {
"format": "json",
"messages": [
{
"Stockname": "sociosqu ad",
"Total Debt": 8,
"Return on Assets": 1,
"Sector": "Mauris",
"Quick Ration": "1.2"
},
{
"Stockname": "ornare. In",
"Total Debt": 7,
"Return on Assets": 3,
"Sector": "lectus.",
"Quick Ration": "1.2"
},
{
"Stockname": "nec, diam.",
"Total Debt": 5,
"Return on Assets": 5,
"Sector": "eu",
"Quick Ration": "1.2"
}
]
},
"connection_type": "s3"
}
Monitor Source
Use the methods listed in this section to monitor all ingestion history for a source.
Lifetime Ingestion Metrics
Lifetime ingestion metrics methods return information about total data ingested for a source since its creation. Metrics contain information about the number of records ingested as well the estimated volume of data.
- Nexla API
GET /data_sources/5001/metrics
- Nexla API
{
"status": 200,
"metrics": {
"records": 4,
"size": 582
}
}
Aggregated Ingestion Metrics
Aggregated ingestion metrics methods return information about total data ingested every day for a source. Metrics contain information about the number of records ingested as well the estimated volume of data.
Aggregations can be fetched in different aggregation units. Use the method below to fetch reports aggregated daily:
- Nexla API
- Nexla CLI
GET /data_sources/5001/metrics?aggregate=1
...
Optional Query Parameters:
"from": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"to": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"page": <integer page number>,
"size": <number of entries in page>
nexla source metrics 8874
...
Optional Payload Parameters
-d,--days (int) Number of days ago to get the metrics, default is 7
-s,--start (string) UTC datetime in '%Y-%m-%dT%H:%M:%S' format
-e,--end (string) UTC datetime in '%Y-%m-%dT%H:%M:%S' format, default is current time. -s/--start required for this option.
- Nexla API
- Nexla CLI
{
"status": 200,
"metrics": [
{
"time": "2017-02-08",
"record": 53054,
"size": 12476341
},
{
"time": "2017-02-09",
"record": 66618,
"size": 15829589
},
{
"time": "2017-02-10",
"record": 25832,
"size": 6645994
}
]
}
Date (UTC) Records Volume (Bytes) Errors
------------ --------- ---------------- --------
2019-04-25 81577 410348843 0
2019-04-26 97350 460260701 0
2019-04-27 85675 392488855 0
2019-04-28 85646 391447623 0
Sources can be configured to scan for data at a specific ingestion frequency. Use the methods below to view ingestion metrics per ingestion cycle.
- Nexla API
- Nexla CLI
GET /data_sources/5001/metrics/run_summary
...
Optional Query Parameters:
"runId": <starting from unix epoch time of ingestion events>,
"from": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"to": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"page": <integer page number>,
"size": <number of entries in page>
nexla source metrics 6864
...
Optional Payload Parameters
-d,--days (int) Number of days ago to get the metrics, default is 7
-s,--start (string) UTC datetime in '%Y-%m-%dT%H:%M:%S' format
-e,--end (string) UTC datetime in '%Y-%m-%dT%H:%M:%S' format, default is current time. -s/--start required for this option.
- Nexla API
- Nexla CLI
{
"status": 200,
"metrics": {
"1539970426049": {
"records": 1364,
"size": 971330,
"errors": 0
},
"1539990426049": {
"records": 330,
"size": 235029,
"errors": 0
}
}
}
Date (UTC) Records Volume (Bytes) Errors
------------ --------- ---------------- --------
2020-04-21 9 12598 0
Granular Ingestion Status Metrics
Apart from aggregated ingestion metrics methods above that provide visibility into total number of records and total volume of data ingested over a period of time, Nexla also provides methods to view granular details about ingestion events.
You can retrieve ingestion status of a file source to find information like how many files have been read fully, failed ingestion, or queued for ingestion in next ingestion cycle.
- Nexla API
GET /data_sources/5001/metrics/files_stats
...
Optional Parameters
{
"from": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"to": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"status": "one of NOT_STARTED/IN_PROGRESS/COMPLETE/ERROR/PARTIAL"
}
- Nexla API
{
"status": 200,
"metrics": {
"data": {
"COMPLETE": 17
},
"meta": {
"currentPage": 1,
"totalCount": 1,
"pageCount": 1
}
}
}
You can view ingestion status and history per file of a file source. The file source ingestion history methods below return one entry per file by aggregating all ingestion events for each file.
- Nexla API
- Nexla CLI
/data_sources/5001/metrics/files
...
Optional Parameters
{
"from": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"to": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"status": "one of NOT_STARTED/IN_PROGRESS/COMPLETE/ERROR/PARTIAL",
"page": <integer page number>,
"size": <number of entries in page>
}
nexla source read-stats <source_id> [options]
...
options
-d,--days (int) Number of days ago to get the metrics, default is 7
-s,--start (string) UTC datetime in '%Y-%m-%dT%H:%M:%S' format
-e,--end (string) UTC datetime in '%Y-%m-%dT%H:%M:%S' format, default is current time. -s/--start required for this option.
- Nexla API
- Nexla CLI
{
"status": 200,
"metrics": {
"data": [
{
"dataSourceId": 1110,
"dataSetId": 2881,
"size": 436180,
"ingestionStatus": "COMPLETE",
"recordCount": 1000,
"name": "/2017/05/05/22/sub-in-5038-00000-000000000000.json",
"id": null,
"lastModified": "2018-06-04T11:31:24Z",
"error": null,
"lastIngested": "2018-06-04T11:43:11Z",
"errorCount": null
},
{
"dataSourceId": 1110,
"dataSetId": 2881,
"size": 423605,
"ingestionStatus": "COMPLETE",
"recordCount": 1000,
"name": "/2017/05/05/22/sub-in-5038-00000-0000000000001.json",
"id": null,
"lastModified": "2018-06-04T11:31:27Z",
"error": null,
"lastIngested": "2018-06-04T11:43:04Z",
"errorCount": null
}
],
"meta": {
"currentPage": 2,
"totalCount": 12,
"pageCount": 2
}
}
}
File Name Size Records Status Dataset ID Last Ingested (UTC)
-------------------- ------- --------- -------- ------------ ---------------------
/files/demo-1.csv 501689 792 COMPLETE 7751 2019-05-03T21:57:25Z
/files/demo-2.csv 2789267 4383 COMPLETE 7751 2019-05-03T21:57:22Z
You can also bypass per file aggregation and fetch full ingestion history of each file even if it was scanned multiple times.
- Nexla API
GET /data_sources/5001/metrics/files_raw?from=2017-09-25T02:25:26&to=2017-09-28T02:25:26
...
Optional Parameters
{
"from": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"to": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"status": "one of NOT_STARTED/IN_PROGRESS/COMPLETE/ERROR/PARTIAL",
"page": <integer page number>,
"size": <number of entries in page>
}
- Nexla API
{
"status": 200,
"metrics": [
{
"dataSourceId": 1542,
"dataSetId": 4062,
"size": 3692,
"ingestionStatus": "COMPLETE",
"recordCount": 25,
"name": "12-06-2018/Source/D/Books_blank.json",
"id": 6124681,
"lastModified": "2018-06-12T06:56:11Z",
"error": null,
"lastIngested": "2018-06-28T21:02:36Z",
"errorCount": null
},
{
"dataSourceId": 1542,
"dataSetId": null,
"size": 0,
"ingestionStatus": "NOT_STARTED",
"recordCount": 0,
"name": "12-06-2018/Source/D/Books_blank.json",
"id": 6124680,
"lastModified": "2018-06-12T06:56:11Z",
"error": null,
"lastIngested": "2018-06-28T21:02:23Z",
"errorCount": null
},
{
"dataSourceId": 1542,
"dataSetId": null,
"size": 0,
"ingestionStatus": "NOT_STARTED",
"recordCount": 0,
"name": "12-06-2018/Source/D/Books_blank.json",
"id": 6124679,
"lastModified": "2018-06-12T06:56:11Z",
"error": null,
"lastIngested": "2018-06-28T21:02:19Z",
"errorCount": null
}
]
}
You can call the methods below to retrieve source ingestion status per ingestion poll cycle.
- Nexla API
GET /data_sources/5003/metrics/files_cron
...
Optional Parameters
{
"from": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"to": <UTC datetime in '%Y-%m-%dT%H:%M:%S' format>,
"status": "one of NOT_STARTED/IN_PROGRESS/COMPLETE/ERROR/PARTIAL",
"page": <integer page number>,
"size": <number of entries in page>
}
- Nexla API
{
"data": [
{
"dataSourceId": 5003,
"dataSetId": null,
"size": 2064,
"ingestionStatus": "COMPLETE",
"recordCount": 5,
"name": null,
"id": null,
"lastModified": "2018-09-20T04:56:44Z",
"runId": 1537394123916,
"error": null,
"lastIngested": "2018-09-20T04:57:13Z",
"errorCount": null
}
],
"meta": {
"currentPage": 1,
"totalCount": 1,
"pageCount": 1
}
}
Other Monitoring Events
See the section on Monitoring resources for method to view source errors, notifications, quarantine samples, and audit logs.