Datasets
Datasets
A Nexla dataset is a virtual representation of the data model containing schema, samples and metadata inferred from the data. The distinguishing attributes of a data set are its input schema (either detected in the source or from a parent data set) and its set of transformations. The transformations applied to the input schema, or data records matching that schema, define the outgoing schema, which may be associated with a dataset.
List All Datasets
Both Nexla API and Nexla CLI support methods to list all datasets in the authenticated user's account. A successful call returns detailed information like id, owner, dataset's parent (source or another dataset), schema(input and output), and the transform rules to generate each dataset.
- Nexla API
- Nexla CLI
GET /data_sets
Example:
curl https://api.nexla.io/data_sets \
-H "Authorization: Bearer <Access-Token>" \
-H "Accept: application/vnd.nexla.api.v1+json"
nexla dataset list
- Nexla API
- Nexla CLI
[
{
"id": 8159,
"owner": {
"id": 82,
"full_name": "Kunjal Sharma",
"email": "kunjal@nexla.com",
"email_verified_at": "2018-03-06T22:24:47.000Z"
},
"org": {
"id": 1,
"name": "Nexla",
"email_domain": "nexla.com",
"email": null
},
"version": 19116,
"name": null,
"description": null,
"access_roles": ["owner"],
"status": "ACTIVE",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": null,
"parent_data_sets": [
{
"id": 8158,
"name": null,
"description": null,
"updated_at": "2019-08-12T12:59:19.000Z",
"created_at": "2019-08-12T12:54:50.000Z"
}
],
"data_sinks": [
{
"id": 5888,
"name": "test_s3",
"sink_type": "s3"
}
],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-12T12:59:22.000Z",
"created_at": "2019-08-12T12:56:00.000Z",
"tags": []
},
{
"id": 8158,
"owner": {
"id": 82,
"full_name": "Kunjal Sharma",
"email": "kunjal@nexla.com",
"email_verified_at": "2018-03-06T22:24:47.000Z"
},
"org": {
"id": 1,
"name": "Nexla",
"email_domain": "nexla.com",
"email": null
},
"version": 19115,
"name": null,
"description": null,
"access_roles": ["owner"],
"status": "ACTIVE",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": null,
"parent_data_sets": [
{
"id": 8157,
"name": "test_s3_1",
"description": "DataSet #1 detected from test_s3",
"updated_at": "2019-08-12T12:54:51.000Z",
"created_at": "2019-08-12T12:25:53.000Z"
}
],
"data_sinks": [],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-12T12:59:19.000Z",
"created_at": "2019-08-12T12:54:50.000Z",
"tags": []
}
]
id status name
---- -------- -----------------------------
5081 PAUSED test_dataset
5666 INIT test1_dataset
List Datasets for Source
You can retrieve a list of all data sets associated with a particular data source by including a data_source_id query parameter. You can limit the list further by including the source_schema_id query parameter in the GET request. The API will return only data sets for the given data source which have the matching source_schema_id attribute.
- Nexla API
GET /data_sets?data_source_id={data_source_id}&expand=1
- Nexla API
[
{
"id": 8085,
"owner": {
"id": 82,
...
},
"org": {
"id": 1,
...
},
"version": 18906,
"name": "1 - echo",
"description": "DataSet #1 detected from echo",
"access_roles": ["owner"],
"status": "INIT",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": 5963,
"data_source": {
"id": 5963,
...
},
"source_schema_id": "1072858493",
"source_schema": {
"type": "object",
...
},
"parent_data_sets": [],
"data_sinks": [],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"transform_id": null,
"transform": {
"version": 1,
"data_maps": [],
"transforms": []
},
"output_schema": {
"type": "object",
...
},
"output_validation_schema": {},
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-01T12:13:54.000Z",
"created_at": "2019-07-11T10:16:52.000Z",
"tags": []
},
{
"id": 8086,
"owner": {
"id": 82,
...
},
"org": {
"id": 1,
...
},
"version": 18906,
"name": "1 - echo",
"description": "DataSet #2 detected from echo",
"access_roles": ["owner"],
"status": "INIT",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": 5963,
"data_source": {
"id": 5963,
...
},
"source_schema_id": "1072858493",
"source_schema": {
"type": "object",
...
},
"parent_data_sets": [],
"data_sinks": [],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"transform_id": null,
"transform": {
"version": 1,
"data_maps": [],
"transforms": []
},
"output_schema": {
"type": "object",
...
},
"output_validation_schema": {},
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-01T12:13:54.000Z",
"created_at": "2019-07-11T10:16:52.000Z",
"tags": []
}
]
- Nexla API
GET /data_sets?data_source_id={data_source_id}&source_schema_id={source_schema_id}&expand=1
- Nexla API
[
{
"id": 8085,
"owner": {
"id": 82,
...
},
"org": {
"id": 1,
...
},
"version": 18906,
"name": "1 - echo",
"description": "DataSet #1 detected from echo",
"access_roles": ["owner"],
"status": "INIT",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": 5963,
"data_source": {
"id": 5963,
...
},
"source_schema_id": "1072858493",
"source_schema": {
...
},
"parent_data_sets": [],
"data_sinks": [],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"transform_id": null,
"transform": {
"version": 1,
"data_maps": [],
"transforms": []
},
"output_schema": {
"type": "object",
...
},
"output_validation_schema": {},
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-01T12:13:54.000Z",
"created_at": "2019-07-11T10:16:52.000Z",
"tags": []
}
]
Show One Dataset
Fetch a specific dataset accessible to the authenticated user. In case of Nexla API, add an expand
query param with a truthy value to get more details about the dataset. With this parameter, full details about the related resources (detected datasets, credentials, etc) will also be returned.
A data set has either a non-null source_schema or parent_data_set. The former refers to a schema detected in data read from the data source itself. The latter refers to the data set which precedes the current one in the pipeline of data processing.
A data set always has a transform attribute, which may be null or an empty object. This transform is applied to an data incoming from the source or parent data set to produce outgoing data matching the output_schema.
A data set may have non-empty data_samples attribute containing one or more objects matching the schema from source_schema or parent_data_set.
- Nexla API
- Nexla CLI
GET /data_sets/{data_set_id}?expand=1
nexla dataset get <dataset_id>
- Nexla API
- Nexla CLI
{
"id": 8086,
"owner": {
"id": 82,
...
},
"org": {
"id": 1,
...
},
"version": 18914,
"name": "echo",
"description": "",
"access_roles": ["owner"],
"status": "ACTIVE",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": null,
"source_schema_id": null,
"source_schema": {},
"parent_data_sets": [
{
"id": 8085,
...
}
],
"data_sinks": [],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"transform_id": 10858,
"transform": {
"version": 1,
"data_maps": [],
"transforms": [
{
...
}
]
},
"output_schema": {
"type": "object",
"properties": {
...
},
"$schema": "http://json-schema.org/draft-04/schema#",
"$schema-id": 734129478
},
"output_validation_schema": {},
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-01T12:57:54.000Z",
"created_at": "2019-07-11T10:16:56.000Z",
"tags": []
}
{
"id": 8086,
"owner": {
"id": 82,
...
},
"org": {
"id": 1,
...
},
"version": 18914,
"name": "echo",
"description": "",
"access_roles": ["owner"],
"status": "ACTIVE",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": null,
"source_schema_id": null,
"source_schema": {},
"parent_data_sets": [
{
"id": 8085,
...
}
],
"data_sinks": [],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"transform_id": 10858,
"transform": {
"version": 1,
"data_maps": [],
"transforms": [
{
...
}
]
},
"output_schema": {
"type": "object",
"properties": {
...
},
"$schema": "http://json-schema.org/draft-04/schema#",
"$schema-id": 734129478
},
"output_validation_schema": {},
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-01T12:57:54.000Z",
"created_at": "2019-07-11T10:16:56.000Z",
"tags": []
}
Create A Dataset
Dataset creation requires a parent dataset to define the input of that dataset and a transform to define how the input will be modified to generate the output of that dataset. See section on transforms for the different ways of creating transforms.
- Nexla API
- Nexla CLI
POST /data_sets
Example Request Body
...
{
"name": "Test Dataset",
"description": "",
"parent_data_set_id": 22186,
"has_custom_transform": true,
"transform": {
"version": 1,
"data_maps": [],
"transforms": [],
"custom": true
}
}
- Nexla API
{
"id": 8086,
"owner": {
"id": 82,
...
},
"org": {
"id": 1,
...
},
"version": 18914,
"name": "Test Dataset",
"description": "",
"access_roles": ["owner"],
"status": "ACTIVE",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": null,
"source_schema_id": null,
"source_schema": {},
"parent_data_sets": [
{
"id": 8085,
...
}
],
"data_sinks": [],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"transform_id": 10858,
"transform": {
"version": 1,
"data_maps": [],
"transforms": [
{
...
}
]
},
"output_schema": {
"type": "object",
"properties": {
...
},
"$schema": "http://json-schema.org/draft-04/schema#",
"$schema-id": 734129478
},
"output_validation_schema": {},
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-01T12:57:54.000Z",
"created_at": "2019-07-11T10:16:56.000Z",
"tags": []
}
Update a Dataset
Nexla API supports methods to update any property of an existing dataset the authenticated user has access to.
- Nexla API
PUT /data_sets/5023
Example Request Body
...
{
"name": "Test Dataset",
}
- Nexla API
{
"id": 5023,
"owner": {
"id": 82,
...
},
"org": {
"id": 1,
...
},
"version": 18914,
"name": "echo",
"description": "",
"access_roles": ["owner"],
"status": "ACTIVE",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": null,
"source_schema_id": null,
"source_schema": {},
"parent_data_sets": [
{
"id": 8085,
...
}
],
"data_sinks": [],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"transform_id": 10858,
"transform": {
"version": 1,
"data_maps": [],
"transforms": [
{
...
}
]
},
"output_schema": {
"type": "object",
"properties": {
...
},
"$schema": "http://json-schema.org/draft-04/schema#",
"$schema-id": 734129478
},
"output_validation_schema": {},
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-01T12:57:54.000Z",
"created_at": "2019-07-11T10:16:56.000Z",
"tags": []
}
Update with Custom Transform
Data set transforms are normally constructed through the schema editor in the Nexla UI, which contains logic for translating user actions on the data set attributes in transform rule syntax.
You can set a transform directly on a data set by including it in your POST or PUT input. Note, the has_custom_transform attribute should be omitted or set to false if the transform you're saving is compatible with the schema editor in the Nexla UI. If your transform contains syntax or modifiers that are not supported in the UI, set has_custom_transform to true to disable the transform tools in the schema editor (which might override or delete your custom modifications).
You can also specify a transform_id
of a previously created transform instead of the transform
object.
- Nexla API
PUT /data_sets/{dataset_id}
Example Request Body
...
{
"has_custom_transform": true,
"transform" : {
"version" : 1,
"data_maps" : [],
"transforms" : [
{
"operation" : "shift",
"spec" : {
"time": "timestamp",
"userId": "userId",
"eventType": "eventType"
}
}
]
}
}
- Nexla API
{
"id": 5023,
"owner": {
"id": 82,
...
},
"org": {
"id": 1,
...
},
"version": 18914,
"name": "echo",
"description": "",
"access_roles": ["owner"],
"status": "ACTIVE",
"sample_service_id": null,
"source_path": {},
"public": false,
"managed": false,
"data_source_id": null,
"source_schema_id": null,
"source_schema": {},
"parent_data_sets": [
{
"id": 8085,
...
}
],
"data_sinks": [],
"sharers": [],
"external_sharers": [],
"has_custom_transform": false,
"transform_id": 10858,
"transform" : {
"version" : 1,
"data_maps" : [],
"transforms" : [
{
"operation" : "shift",
"spec" : {
"time": "timestamp",
"userId": "userId",
"eventType": "eventType"
}
}
]
},
"output_schema": {
"type": "object",
"properties": {
...
},
"$schema": "http://json-schema.org/draft-04/schema#",
"$schema-id": 734129478
},
"output_validation_schema": {},
"output_schema_validation_enabled": false,
"generate_output_schema": false,
"updated_at": "2019-08-01T12:57:54.000Z",
"created_at": "2019-07-11T10:16:56.000Z",
"tags": []
}
Delete A Dataset
Nexla API supports methods to delete any dataset that the authenticated user has administrative/ownership rights to.
If the dataset is paused and does not have any associated downstream resources Nexla can delete the dataset safely. A successful request to delete a dataset returns Ok (200) with no response body.
If the dataset is active or there are downstream resources that will be impacted Nexla will not trigger deletion and instead return a failure message informing about the reason for denying deletion of the dataset.
- Nexla API
DELETE /data_sets/{data_set_id}
- Nexla API
Empty response with status 200 for success
Error response with reason if dataset could not be deleted
Activate and Pause Datas
Trigger Nexla to start ingesting data immediately by calling the activation method on that source. Note that Nexla source usually contains parameters to schedule automatic ingestion based on cron intervals or completion of other jobs. This activation method triggers an ingestion in addition to the scheduled automatic source ingestion.
- Nexla API
- Nexla CLI
PUT /data_sources/{data_source_id}/activate
nexla source activate <source_id>
On the flip side, call the pause method to immediately stop ingestion on that source. Any subsequent scheduled ingestion intervals will be ignored as long as the source is paused.
- Nexla API
- Nexla CLI
PUT /data_sources/{data_source_id}/pause
nexla source pause <source_id>
Monitor Dataset
See the section on Monitoring resources for method to view dataset errors, notifications, quarantine samples, and audit logs.