Skip to main content

Working with Data Files

Structured files that are intended for loading into a DX Graph Collection are referred to as Data Files. Data Files are uploaded into Buckets where they can be validated for correctness and loaded into a Collection. DX Graph provides multiple types of Jobs for working with data files.

All of the above Data File jobs perform validation. If validation errors are found during the validation of a Data File, a Validation Errors file is created with the following naming convention: {orginalFilename}.{YYYYMMDD_HHmmss}.errors.jsonl e.g. products.csv.20230926_160054.errors.jsonl.

The Validation Errors file will be generated alongside the original Data File in the Bucket.

Validation consists of the following steps:

  • Parse Validation: The Data File is parsed according to the specified Parse Options. If the Data File cannot be parsed, the Data File is considered invalid.
  • Source Schema Validatiion: The parsed Data File is validated against the specified Source Schema. If the Data File does not conform to the Source Schema, the Data File is considered invalid. (Optional.)
  • Transformation: While this step is not strictly a validation step, it is a step that is performed on the Data File. The Data File is transformed according to the specified Transformers, which is a list of transforms that are applied to each record in the Data File. Each transformation may, optionally, have a schema associated with it. If a schema is specified, the transformed record is validated against the schema. If the transformed record does not conform to the schema, the Data File is considered invalid. (Optional.)
  • Target Schema Validation: The transformed Data File is validated against the specified Target Schema. A Target Schema can be manually specified or, typically, taken from a Collection's schema. If the Data File does not conform to the Target Schema, the Data File is considered invalid. (Optional.)

Alt text

Job Types

The various job types related to working with files are documented here.

Analyze Data File Endpoint

The Analyze Data File endpoint allows you to specify many of the parameters above and see the results immediately to help with the configuration of validation and import jobs. This API will return the following information:

FieldDescription
nbrIssuesThe total number of issues found in the entire source file.
nbrFinalRecordsThe total number of valid records in the entire source file.
issuesThe first 10 records that have issues.
validRecordsThe first 10 records that are valid.
POST https://io.conscia.ai/vue/_api/v1/buckets/incoming/files/_analyze
content-type: application/json
Authorization: Bearer {{apiKey}}
X-Customer-Code: {{customerCode}}

{
"filename": "products.csv",
"sourceSchema": {
"type": "object",
"properties": {
"product_id": { "type": "string" },
"name": { "type": "string" },
"brand": { "type": "string" },
"price": { "type": "number" }
},
"required": ["product_id"]
},
"recordIdentifierField": "product_id",
"transformers": [
{
"type": "javascript",
"config": {
"expression": "_.set(data, 'category', 'electronics')",
}
}
],
"targetSchema": {
"type": "object",
"properties": {
"product_id": { "type": "string" },
"name": { "type": "string" },
"brand": { "type": "string" },
"price": { "type": "number" },
"category": { "type": "string" }
},
"required": ["product_id", "category"]
},
"parseOptions": {
"format": "DELIMITED",
"delimiter": ",",
"quoteChar": "\"",
"escapeChar": "\"",
},
"collectionCode": "product"
}

Common Data File Job Type Parameters

Parse Options

The following are the available parse options:

ParameterDescription
formatThe format of the file. One of DELIMITED, EXCEL, JSONL.
delimiterThe delimiter used to separate fields. Applicable for DELIMITED.
quoteCharThe character used to quote fields. Applicable for DELIMITED.
escapeCharThe character used to escape quotes. Applicable for DELIMITED.
sheetNameThe name of the sheet to import. Only applicable for EXCEL.

Filename Patterns

Filename patterns are used to group files together for processing.

Matching Features

  • Wildcards (*, *.js)
  • Negation (!a*.js, *!(b).js)
  • Extglobs (+(x|y), !(a|b))
  • Brace expansion (foo/{1..5}.md, bar/{a,b,c}.js)
  • Regex character classes (foo-[1-5].js)
  • Regex logical "or" (foo_(abc|xyz).js)
Example PatternMatches
*.jsAll files with the extension .js
(x|y)All files with the name x or y
!(a|b)All files except a or b
foo_{1..5}.mdAll files with the name foo_1.md, foo_2.md, foo_3.md, foo_4.md, foo_5.md
bar_{a,b,c}.jsAll files with the name bar_a.js, bar_b.js, bar_c.js
foo-[1-5].jsAll files with the name foo-1.js, foo-2.js, foo-3.js, foo-4.js, foo-5.js
foo_(abc|xyz).jsAll files with the name foo_abc.js or foo_xyz.js

Transformers

Transformers are used to transform records in a Data File. Transformers are specified as a list of transforms that are applied to each record in the Data File. Transformations may include setting additional data or metadata, concatenating or splitting fields, converting data types, renaming properties, or picking specific fields to import.

Each transformation may, optionally, have a schema associated with it. If a schema is specified, the transformed record is validated against the schema. If the transformed record does not conform to the schema, the Data File is considered invalid.

The following are the available transformers:

Javascript

The Javascript transformer allows you to specify a Javascript expression that is applied to each record in the Data File. The expression is evaluated using the following variables:

  • data: The record being transformed.
  • dayjs: The dayjs library.
  • _: The lodash library.
{
"type": "javascript",
"config": {
"expression": "_.set(data, 'category', 'electronics')",
}
}

The following are the available configuration options:

ParameterDescription
expressionThe Javascript expression to evaluate.

Each expression must return a JSON object. The returned object is the transformed record.

JSONata

The JSONata transformer allows you to specify a JSONata expression that is applied to each record in the Data File. The expression is evaluated using the following variables:

  • data: The record being transformed.
{
"type": "jsonata",
"config": {
"expression": "$extend(data, { category: 'electronics' })",
}
}

The following are the available configuration options:

ParameterDescription
expressionThe JSONata expression to evaluate.

Each expression must return a JSON object. The returned object is the transformed record.

Record Identifier Field

The recordIdentifierField indicates which field uniquely identifies the records in the Data Files. The data in the field that you select to be the unique identifier for your Collection must conform to the following character specifications:

  • String
  • 250 character limit
  • Alpha (lowercase and uppercase)
  • Numeric digit
  • _ - . @ ( ) + , = ; $ ! * ' %
  • No whitespace

Error Files

Error files are generated when a Data File is validated, imported or transformed. Error files are created as both JSONL (where each error is one JSON on one line) and CSV files. The CSV files are created for convenience and are not intended to be used for further processing. The filename of the error file is the same as the original Data File with the following suffix: {{sourceFilename}}-YYYYMMDD_HHmmss.errors.jsonl or .csv where YYYYMMDD_HHmmss is the timestamp of when the error file was generated.

The fields of an error file are as follows:

FieldDescription
errorTypeString. One of: INVALID_HEADER_PARSE (expected headers were not found), INVALID_ROW_PARSE (could not parse a line of data), INVALID_ROW_SCHEMA (parsed record did not conform to specified), TRANSFORMATION_ERROR (error occurred during transformation).
rowNbrNumber. If available, the row number of the file that caused the error.
recordAsStringString. If available, the line that caused the error.
parsedFieldsThe list of fields that were parsed.
errorMessageString. Description of what the error is.
recordIdentifierThe unique id of the parsed record.
recordAsJsonObject. If available, the parsed line that caused the error.
transformationIndexNumber. The index of transformation that caused the error.
preTransformedObjectObject. The record before transformation.
postTransformedObjectObject. The record after transformation.