Skip to main content

Working with Data Files

Structured files that are intended for loading into a DX Graph Collection are referred to as Data Files. Data Files are uploaded to into Buckets where they can be validated for correctness and loaded into a Collection. DX Graph provides multiple types of Jobs for working with Data Files.

  • Validate Data Files
  • Import Data Files
  • Transform Data Files

All of the above Data File jobs performs validation. If validation errors are found during the validation of a Data File, a Validation Errors file is created with the following naming convention: {orginalFilename}.{YYYYMMDD_HHmmss}.errors.jsonl e.g. products.csv.20230926_160054.errors.jsonl The Validation Errors file will be generated alongside the original Data File in the Bucket.

Validation consists of the following steps:

  • Parse Validation: The Data File is parsed according to the specified Parse Options. If the Data File cannot be parsed, the Data File is considered invalid.
  • Source Schema Validatiion: The parsed Data File is validated against the specified Source Schema. If the Data File does not conform to the Source Schema, the Data File is considered invalid.
  • Transformation: While this step is not strictly a validation step, it is a step that is performed on the Data File. The Data File is transformed according to the specified Transformers, which is a list of transforms that are applied to each record in the Data File. Each transformation may, optionally, have a schema associated with it. If a schema is specified, the transformed record is validated against the schema. If the transformed record does not conform to the schema, the Data File is considered invalid.
  • Target Schema Validation: The transformed Data File is validated against the specified Target Schema. A Target Shema is can be manually specified or, typically, taken from a Collection's schema. If the Data File does not conform to the Target Schema, the Data File is considered invalid.

Alt text

Job Types

Validate Data Files

The Validate Data File job ensures that a set of Data Files (specified by a filename pattern) is parseable and conforms to a specified schema. Depending on whether the Data Files were successfully validated, they can be moved to different buckets.

Job Parameters

ParameterDescription
Source Data BucketThe DX Graph Bucket that contains the Data Files specified by Filename Pattern.
Validated Data BucketIf this is specified, successfully validated Data Files are moved to this Bucket.
Invalid Data BucketIf this is specified, unsuccessfully validated Data Files are moved to this Bucket.
Filename PatternGroups files into a set of files to be processed together. e.g. products_*.csv
Source SchemaThe JSON Schema that must be conformed to.
Record Identifier FieldIndicates which field uniquely identifies the records in the Data Files. Validation errors use this to point out erroneous records.
Parse OptionsConfigures how to parse the Source Data Files. e.g. Are the files delimited vs Excel vs JSON format?
Collection CodeThe Collection to import Data Files into
TransformersA list of transformations applied to each validated source record.
Target SchemaJSON schema that is applied to the transformed records. Default: uses the schema of specified Collection Code.

Import Data Files

The Import Data Files job validates Data Files (like the Validates Data Files job) and imports them into a Collection. Depending on whether the Data Files were successfully imported, they can be moved to different buckets.

Job Parameters

ParameterDescription
Incoming Data BucketThe DX Graph Bucket that contains the Data Files specified by Filename Pattern.
Skipped BucketA file that has been skipped due to Process Last Matched File is true will be moved here. It will not be validated.
Processed BucketA file that is fully imported into a collection with no validation errors is moved here. A file that has any validation errors with Skip Invalid Records is true will be moved here along with the corresponding error files.
Invalid Data BucketThis is mandatory if Skip Invalid Records is false. A file that has any validation errors when Skip Invalid Records is false will be moved here along with the corresponding error files.
Filename PatternGroups files into a set of files to be processed together. e.g. products_*.csv
Source SchemaThe JSON Schema that must be conformed to.
Record Identifier FieldIndicates which field uniquely identifies the records in the Data Files. Validation errors use this to point out erroneous records.
Parse OptionsConfigures how to parse the Source Data Files. e.g. Are the files delimited vs Excel vs JSON format?
Collection CodeThe Collection to import Data Files into
TransformersA list of transformations applied to each validated source record.
Target SchemaJSON schema that is applied to the transformed records. Default: uses the schema of specified Collection Code.
Skip Invalid RecordsDefault: false. If false, if any validation errors occurred, no data will be imported.
Process Last Matched FileData Files are scanned in alphabetical order so that you are able to use filenames to sequence the processing sequence. If this parameter is set to true, then only the last matching Data File will be processed and the previous files will be skipped.

Transform Data Files

The Transform Data Files job validates Data Files (like the Validates Data Files job) and transforms them.

Job Parameters

ParameterDescription
Source Data BucketThe DX Graph Bucket that contains the Data Files specified by Filename Pattern.
Target Data BucketA file with no validation errors is moved here or (a file that has validation errors with Skip Invalid Records is true) will be moved here along with the transformed file and any corresponding error files. The transformed files will be JSONL format and will have the filename: {{sourceFilename}}.YYYYMMDD_HHmmss.transformed.jsonl where YYYYMMDD_HHmmss is the timestamp of when the file was generated.
Invalid Data BucketThis is mandatory if Skip Invalid Records is false. A file that has any validation errors when Skip Invalid Records is false will be moved here along with the corresponding error files.
Filename PatternGroups files into a set of files to be processed together. e.g. products_*.csv
Source SchemaThe JSON Schema that must be conformed to.
Record Identifier FieldIndicates which field uniquely identifies the records in the Data Files. Validation errors use this to point out erroneous records.
Parse OptionsConfigures how to parse the Source Data Files. e.g. Are the files delimited vs Excel vs JSON format?
TransformersA list of transformations applied to each validated source record.
Target SchemaJSON schema that is applied to the transformed records. Default: uses the schema of specified Collection Code.
Skip Invalid RecordsDefault: false. If false, if any validation errors occurred, no data will be imported.

The following diagram shows how the Data File jobs fit into the overall Data File processing workflow.

Alt text

Analyze Data File Endpoint

The Analyze Data File endpoint allows you to specify many of the parameters above and see the results immediately to help with the configuration of validation and import jobs. This API will return the following information:

FieldDescription
nbrIssuesThe total number of issues found in the entire source file.
nbrFinalRecordsThe total number of valid records in the entire source file.
issuesThe first 10 records that have issues.
validRecordsThe first 10 records that are valid.
POST https://io.conscia.ai/vue/_api/v1/buckets/incoming/files/_analyze
content-type: application/json
Authorization: Bearer {{apiKey}}
X-Customer-Code: {{customerCode}}

{
"filename": "products.csv",
"sourceSchema": {
"type": "object",
"properties": {
"product_id": { "type": "string" },
"name": { "type": "string" },
"brand": { "type": "string" },
"price": { "type": "number" }
},
"required": ["product_id"]
},
"recordIdentifierField": "product_id",
"transformers": [
{
"type": "javascript",
"config": {
"expression": "_.set(data, 'category', 'electronics')",
}
}
],
"targetSchema": {
"type": "object",
"properties": {
"product_id": { "type": "string" },
"name": { "type": "string" },
"brand": { "type": "string" },
"price": { "type": "number" },
"category": { "type": "string" }
},
"required": ["product_id", "category"]
},
"parseOptions": {
"format": "DELIMITED",
"delimiter": ",",
"quoteChar": "\"",
"escapeChar": "\"",
},
"collectionCode": "product"
}

Common Job Type Parameters

Parse Options

The following are the available parse options:

ParameterDescription
formatThe format of the file. One of DELIMITED, EXCEL, JSONL.
delimiterThe delimiter used to separate fields. Applicable for DELIMITED.
<!--hasHeader
quoteCharThe character used to quote fields. Applicable for DELIMITED.
escapeCharThe character used to escape quotes. Applicable for DELIMITED.
sheetNameThe name of the sheet to import. Only applicable for EXCEL.

Filename Patterns

Filename patterns are used to group files together for processing.

Matching Features

  • Wildcards (*, .js)
  • Negation ('!a/.js', '!(b).js')
  • extglobs (+(x|y), !(a|b))
  • brace expansion (foo/{1..5}.md, bar/{a,b,c}.js)
  • regex character classes (foo-[1-5].js)
  • regex logical "or" (foo_(abc|xyz).js)
Example PatternMatches
*.jsAll files with the extension .js
(x|y)All files with the name x or y
!(a|b)All files except a or b
foo_{1..5}.mdAll files with the name foo_1.md, foo_2.md, foo_3.md, foo_4.md, foo_5.md
bar_{a,b,c}.jsAll files with the name bar_a.js, bar_b.js, bar_c.js
foo-[1-5].jsAll files with the name foo-1.js, foo-2.js, foo-3.js, foo-4.js, foo-5.js
foo_(abc|xyz).jsAll files with the name foo_abc.js, foo_xyz.js

Transformers

Transformers are used to transform records in a Data File. Transformers are specified as a list of transforms that are applied to each record in the Data File. Each transformation may, optionally, have a schema associated with it. If a schema is specified, the transformed record is validated against the schema. If the transformed record does not conform to the schema, the Data File is considered invalid.

The following are the available transformers:

Javascript

The Javascript transformer allows you to specify a Javascript expression that is applied to each record in the Data File. The expression is evaluated using the following variables:

  • data: The record being transformed.
  • dayjs: The dayjs library.
  • _: The lodash library.
{
"type": "javascript",
"config": {
"expression": "_.set(data, 'category', 'electronics')",
}
}

The following are the available configuration options:

ParameterDescription
expressionThe Javascript expression to evaluate.

Each expression must return a JSON object. The returned object is the transformed record.

JSONata

The JSONata transformer allows you to specify a JSONata expression that is applied to each record in the Data File. The expression is evaluated using the following variables:

  • data: The record being transformed.
{
"type": "jsonata",
"config": {
"expression": "$extend(data, { category: 'electronics' })",
}
}

The following are the available configuration options:

ParameterDescription
expressionThe JSONata expression to evaluate.

Each expression must return a JSON object. The returned object is the transformed record.

Error Files

Error files are generated when a Data File is validated, imported or transformed. Error files are created as both JSONL (where each error is one JSON on one line) and CSV files. The CSV files are created for convenience and are not intended to be used for further processing. The filename of the error file is the same as the original Data File with the following suffix: {{sourceFilename}}.YYYYMMDD_HHmmss.errors.jsonl or .errors.csv where YYYYMMDD_HHmmss is the timestamp of when the file was generated.

The fields of an error file are as follows:

FieldDescription
errorTypeString. One of INVALID_HEADER_PARSE (expected headers were not found), INVALID_ROW_PARSE (could not parse a line of data), INVALID_ROW_SCHEMA (parsed record did not conform to specified), TRANSFORMATION_ERROR (error occurred during transformation).
rowNbrNumber. If available, the row number of the file that caused the error.
recordAsStringString. If available, the line that caused the error.
parsedFieldsThe list of fields that were parsed.
errorMessageString. Description of what the error is.
recordIdentifierThe unique id of the parsed record.
recordAsJsonObject. If available, the parsed line that caused the error.
transformationIndexNumber. The index of transformation that caused the error.
preTransformedObjectObject. The record before transformation.
postTransformedObjectObject. The record after transformation.