Skip to main content

Loading Data into Collections

In this tutorial, section, we will load data into the Collections that we created in Creating Collections.

For a more detailed explanation of the data loading process, see Working with Data Files.

Creating Buckets

We will create the following buckets:

  • Incoming
  • Processed
  • Invalidated

Using the UI

Navigate to the Files page and, in the left-pane, right-click and select Create.

Specify the following in the dialog that appears:

NameUnique Identifier for Folder
Incomingincoming
Processedprocessed
Invalidatedinvalidated

Alt text

Using the API

The following API calls will create the buckets:

PUT https://io.conscia.ai/vue/_api/v1/buckets/incoming
content-type: application/json
Authorization: Bearer {{apiKey}}
X-Customer-Code: {{customerCode}}

{
"dataBucketEntry": {
"name": "Incoming"
}
}

PUT https://io.conscia.ai/vue/_api/v1/buckets/processed
content-type: application/json
Authorization: Bearer {{apiKey}}
X-Customer-Code: {{customerCode}}

{
"dataBucketEntry": {
"name": "Processed"
}
}

PUT https://io.conscia.ai/vue/_api/v1/buckets/invalidated
content-type: application/json
Authorization: Bearer {{apiKey}}
X-Customer-Code: {{customerCode}}

{
"dataBucketEntry": {
"name": "Invalidated"
}
}

Uploading Data

We will upload the movies and actors CSV files to the incoming bucket.

Using the UI

Navigate to the Files page and, in the left-pane, select the Incoming bucket (that was created above) and click Upload in the right-pane. Located and select the movies.csv and actors.csv files from your local machine and click Upload.

Using the API

curl --request POST \
--url https://io.conscia.ai/vue/_api/v1/buckets/incoming/upload \
--header 'authorization: Bearer {{apiKey}}' \
--header 'X-Customer-Code: {{customerCode}}' \
--form 'file[]=@movies.csv' \
--form 'file[]=@actors.csv'

The response be:

[
{
"filename": "movies.csv",
"message": "Successfully uploaded."
},
{
"filename": "actors.csv",
"message": "Successfully uploaded."
}
]

Importing Data Files

When importing a file you specify:

  • how to parse the file (referred to as Parse Options)
  • the expected schema of the file (i.e. how the data should be structured)
  • how to transform the data from the file into the expected schema (if required)

The Parse Options for both the movies.csv and actors.csv files:

{
"format": "DELIMITED",
"delimiter": ",",
"quoteChar": "\"",
"escapeChar": "\""
}

Import Movies

The following table lists the fields of the movie collections and how the data will be parsed and/or transformed from the movies.csv file. It is based on the movie collection create in the Creating Collections tutorial.

Target FieldSource Field(s)Required Transformation
movie_idmovie_idThis is a required field.
titletitleThis is a required field.
yearyearEnsure that the value is a number.
runtimeruntimeRemove \N values. Ensure that the value is a number.
genresgenresTransform comma-delimited list of genres into an array.
actor_idsactor_idsTransform comma-delimited list of genres into an array.

Source Schema

The Source Schema is a JSON schema that describes the data in the CSV file. It does not have to match the schema of the collection. In fact, it is common for the Source Schema to be different from the schema of the collection. Every row in a delimited file is converted into a JSON object. The Source Schema describes the structure of that JSON object.

Consider the following line of the movies CSV file:

tt10969190,Stick Monster Movie,2020,\N,Fantasy,"nm8762035,nm4027115"

The JSON representation of this line is:

{
"movie_id": "tt10969190",
"title": "Stick Monster Movie",
"year": "2020",
"runtime": "\\N",
"genres": "Fantasy",
"actor_ids": "nm8762035,nm4027115"
}

The Source Schema for the movies.csv file is:

{
"type": "object",
"properties": {
"movie_id": { "type": "string" },
"title": { "type": "string" },
"year": { "type": "number" },
"runtime": { "type": "string" },
"genres": { "type": "string" },
"actor_ids": { "type": "string" }
},
"required": ["movie_id", "title"]
}
note

We can make year a number since we know that every value will be a numeric value. The JSON schema check will coerce the value to a number. If it is unable to coerce the value to a number, it will fail the validation. Otherwise, this field will not need to be transformed to a number since it was already coerced to a number.

note

We specified runtime to be a string even though we want to convert it to a number. We could made this a number if we knew that the value would always be a number. However, we know that this field contains the value \N which is not a number.

Transformers

See this for more details on transformers.

The transformers specification for the movies file is:

[
{ "type": "javascript", "config": { "expression": "_.set(data, 'runtime', data.runtime==='\\\\N' ? undefined : _.toNumber(data.runtime) )" } },
{ "type": "javascript", "config": { "expression": "_.set(data, 'genres', _.split(data.genres, ','))" } },
{ "type": "javascript", "config": { "expression": "_.set(data, 'actor_ids', _.split(data.actor_ids, ','))" } }
]

Remember, the data variable in the expression represents the JSON object that represents the current row in the CSV file. The data variable is a JavaScript object. The _.set function is a lodash function that allows you to set a value in a JavaScript object. The first argument is the object that you want to set the value on. The second argument is the path to the value that you want to set. The third argument is the value that you want to set. The _.split function is a lodash function that allows you to split a string into an array. The first argument is the string that you want to split. The second argument is the delimiter that you want to use to split the string.

note

\N in the CSV file is turned into { "runtime": "\\N" } in JSON format since the backslash is escaped. Above, we check set runtime to undefined (which removes runtime from the JSON object) if it has a value of \\N. We must, again, escape the backslash since it is a special character in JavaScript.

Analyze the configurations

We will use the Data File Analyze endpoint to analyze the configurations used to load the movies.csv file.

POST https://io.conscia.ai/vue/_api/v1/buckets/incoming/files/_analyze
content-type: application/json
Authorization: Bearer {{apiKey}}
X-Customer-Code: {{customerCode}}

{
"filename": "movies.csv",
"sourceSchema": {
"type": "object",
"properties": {
"movie_id": { "type": "string" },
"title": { "type": "string" },
"year": { "type": "number" },
"runtime": { "type": "string" },
"genres": { "type": "string" },
"actor_ids": { "type": "string" }
},
"required": ["movie_id", "title"]
},
"recordIdentifierField": "movie_id",
"transformers": [
{ "type": "javascript", "config": { "expression": "_.set(data, 'runtime', data.runtime==='\\N' ? undefined : _.toNumber(data.runtime) )" } },
{ "type": "javascript", "config": { "expression": "_.set(data, 'genres', _.split(data.genres, ','))" } },
{ "type": "javascript", "config": { "expression": "_.set(data, 'actor_ids', _.split(data.actor_ids, ','))" } }
],
"parseOptions": {
"format": "DELIMITED",
"delimiter": ",",
"quoteChar": "\"",
"escapeChar": "\""
}
}

The result of the above call to the analyze endpoint will look like this:

{
"nbrIssues": 0,
"nbrFinalRecords": 11573,
"issues": [],
"validRecords": [
{
"recordIdentifier": "tt10396038",
"recordAsJson": {
"movie_id": "tt10396038",
"title": "Neither Hero Nor Traitor",
"year": 2020,
"runtime": 73,
"genres": [
"Drama",
"History"
],
"actor_ids": [
"nm6286339",
"nm0261957",
"nm8358700",
"nm7506705"
]
}
},
{
"recordIdentifier": "tt10397306",
"recordAsJson": {
"movie_id": "tt10397306",
"title": "Madison",
"year": 2020,
"runtime": 87,
"genres": [
"Adventure",
"Drama",
"Family"
],
"actor_ids": [
"nm10055076",
"nm0525518",
"nm0913217",
"nm12691322"
]
}
},
{
"recordIdentifier": "tt10397752",
"recordAsJson": {
"movie_id": "tt10397752",
"title": "Disordered",
"year": 2020,
"runtime": 89,
"genres": [
"Animation",
"Documentary",
"Family"
],
"actor_ids": [
"nm11496979"
]
}
},
{
"recordIdentifier": "tt10414756",
"recordAsJson": {
"movie_id": "tt10414756",
"title": "Party Hard",
"year": 2020,
"runtime": 103,
"genres": [
"Comedy",
"Drama"
],
"actor_ids": [
"nm6073478",
"nm10731923",
"nm10731924",
"nm4581810"
]
}
},
{
"recordIdentifier": "tt10423160",
"recordAsJson": {
"movie_id": "tt10423160",
"title": "Kaiji: Final Game",
"year": 2020,
"runtime": 128,
"genres": [
"Adventure"
],
"actor_ids": [
"nm7139946",
"nm0297885",
"nm4445437",
"nm3810403",
"nm5018289",
"nm2968873",
"nm4341140",
"nm10421125"
]
}
},
{
"recordIdentifier": "tt10935362",
"recordAsJson": {
"movie_id": "tt10935362",
"title": "The Demon of Daingerfield",
"year": 2020,
"genres": [
"Horror"
],
"actor_ids": [
"nm4049779"
]
}
},
{
"recordIdentifier": "tt10936214",
"recordAsJson": {
"movie_id": "tt10936214",
"title": "Bring 'em Black",
"year": 2020,
"runtime": 120,
"genres": [
"Western"
],
"actor_ids": [
"nm1253986",
"nm9555778",
"nm8531011",
"nm5274937"
]
}
},
{
"recordIdentifier": "tt10937670",
"recordAsJson": {
"movie_id": "tt10937670",
"title": "Figli",
"year": 2020,
"runtime": 97,
"genres": [
"Comedy"
],
"actor_ids": [
"nm0348347",
"nm0181246",
"nm0557609",
"nm2862095"
]
}
},
{
"recordIdentifier": "tt10943622",
"recordAsJson": {
"movie_id": "tt10943622",
"title": "This Is Not Normal",
"year": 2020,
"genres": [
"Documentary"
],
"actor_ids": [
"nm1171860",
"nm1544202"
]
}
},
{
"recordIdentifier": "tt10944276",
"recordAsJson": {
"movie_id": "tt10944276",
"title": "Death of a Telemarketer",
"year": 2020,
"runtime": 88,
"genres": [
"Comedy",
"Drama",
"Thriller"
],
"actor_ids": [
"nm2031358",
"nm0355097",
"nm0005286",
"nm5131141"
]
}
}
]
}

As explained here, the validRecords field contains the first 10 valid records in the file.

Performing the Import

POST https://io.conscia.ai/vue/_api/v1/buckets/incoming/files/_import
content-type: application/json
Authorization: Bearer {{apiKey}}
X-Customer-Code: {{customerCode}}

{
"skippedBucketCode": "skipped",
"processedBucketCode": "processed",
"invalidBucketCode": "invalidated",
"filenamePattern": "movies.csv",
"skipInvalidRecords": false,
"sourceSchema": {
"type": "object",
"properties": {
"movie_id": { "type": "string" },
"title": { "type": "string" },
"year": { "type": "number" },
"runtime": { "type": "string" },
"genres": { "type": "string" },
"actor_ids": { "type": "string" }
},
"required": ["movie_id", "title"]
},
"recordIdentifierField": "movie_id",
"transformers": [
{ "type": "javascript", "config": { "expression": "_.set(data, 'runtime', data.runtime==='\\\\N' ? undefined : _.toNumber(data.runtime) )" } },
{ "type": "javascript", "config": { "expression": "_.set(data, 'genres', _.split(data.genres, ','))" } },
{ "type": "javascript", "config": { "expression": "_.set(data, 'actor_ids', _.split(data.actor_ids, ','))" } }
],
"collectionCode": "movie",
"parseOptions": {
"format": "DELIMITED",
"delimiter": ",",
"quoteChar": "\"",
"escapeChar": "\""
}
}

The result should look like this:

{
"filenamesToProcess": [
"movies.csv"
],
"filenamesToSkip": [],
"processedFiles": [
{
"filename": "movies.csv",
"nbrValidRecords": 11573,
"nbrValidationIsses": 0,
"nbrInserts": 11573,
"nbrUpdates": 0,
"message": "Successfully loaded file: movies.csv into movie. "
}
],
"hasInvalidRecords": false
}

Using the UI, navigate to the Processed bucket, you should see the movies.csv file. The movies.csv file will no longer be in the Incoming bucket. If there were any invalid records, they would be in the Invalidated bucket.

Upload the movies.csv file to processed bucket, again.

curl --request POST \
--url https://io.conscia.ai/vue/_api/v1/buckets/incoming/upload \
--header 'authorization: Bearer {{apiKey}}' \
--header 'X-Customer-Code: {{customerCode}}' \
--form 'file[]=@movies.csv'

Not let's re-import with with a configuration with known issues. We will convert runtime to a number even though we know that some values are \N. We have to set skipInvalidRecords to true since we know that there are invalid records and we still want to import the valid records.

POST https://io.conscia.ai/vue/_api/v1/buckets/incoming/files/_import
content-type: application/json
Authorization: Bearer {{apiKey}}
X-Customer-Code: {{customerCode}}

{
"skippedBucketCode": "skipped",
"processedBucketCode": "processed",
"invalidBucketCode": "invalidated",
"filenamePattern": "movies.csv",
"skipInvalidRecords": true,
"sourceSchema": {
"type": "object",
"properties": {
"movie_id": { "type": "string" },
"title": { "type": "string" },
"year": { "type": "number" },
"runtime": { "type": "string" },
"genres": { "type": "string" },
"actor_ids": { "type": "string" }
},
"required": ["movie_id", "title"]
},
"recordIdentifierField": "movie_id",
"transformers": [
{ "type": "javascript", "config": { "expression": "_.set(data, 'runtime', _.toNumber(data.runtime) )" } },
{ "type": "javascript", "config": { "expression": "_.set(data, 'genres', _.split(data.genres, ','))" } },
{ "type": "javascript", "config": { "expression": "_.set(data, 'actor_ids', _.split(data.actor_ids, ','))" } }
],
"collectionCode": "movie",
"parseOptions": {
"format": "DELIMITED",
"delimiter": ",",
"quoteChar": "\"",
"escapeChar": "\""
}
}

The result will be:

{
"filenamesToProcess": [
"movies.csv"
],
"filenamesToSkip": [],
"processedFiles": [
{
"filename": "movies.csv",
"nbrValidRecords": 8619,
"nbrValidationIsses": 2954,
"nbrInserts": 0,
"nbrUpdates": 8619,
"message": "Successfully loaded file: movies.csv into movie. There were some invalid records."
}
],
"hasInvalidRecords": true
}

The existing movie records were updated with the 8,619 valid records. The 2,954 errors are detailed in the files: movies.csv-YYYMMMDD-HHmmss.errors.csv and movies.csv-YYYMMMDD-HHmmss.errors.jsonl which were copied to the processed bucket alongside the original movies.csv file. These two files are described here.

Query the movie collection

POST https://io.conscia.ai/vue/_api/v1/collections/movie/records/_query
Content-Type: application/json
Authorization: Bearer {{apiKey}}
X-Customer-Code: {{customerCode}}

{}

Importing Actors

The following table lists the fields of the actor collections and how the data will be parsed and/or transformed from the actors.csv file. It is based on the actor collection create in the Creating Collections tutorial.

Target FieldSource Field(s)Required Transformation
actor_idactor_idThis is a required field.
namenameThis is a required field.
birth_yearbirthYearRemove \N values. Ensure that the value is a number. Rename field.
death_yeardeathYearRemove \N values. Ensure that the value is a number. Rename field.
professionsprimaryProfessionRemove \N values. Transform comma-delimited list of professions into an array. Rename field.

Source Schema

The Source Schema for the actors.csv file is:

{
"type": "object",
"properties": {
"actor_id": { "type": "string" },
"name": { "type": "string" },
"birthYear": { "type": "string" },
"deathYear": { "type": "string" },
"primaryProfession": { "type": "string" }
},
"required": ["actor_id", "name"]
}

Transformers

The transformers specification for the actors file is:

[
{ "type": "javascript", "config": { "expression": "_.set(data, 'birth_year', data.birthYear==='\\\\N' ? undefined : _.toNumber(data.birthYear))" } },
{ "type": "javascript", "config": { "expression": "_.set(data, 'death_year', data.deathYear==='\\\\N' ? undefined : _.toNumber(data.deathYear))" } },
{ "type": "javascript", "config": { "expression": "_.set(data, 'professions', _.split(data.primaryProfession, ','))" } },
{ "type": "javascript", "config": { "expression": "_.omit(data, ['birthYear', 'deathYear', 'primaryProfession'])" } }
]

Analyze the configurations

We will use the Data File Analyze endpoint to analyze the configurations used to load the actors.csv file.

POST https://io.conscia.ai/vue/_api/v1/buckets/incoming/files/_analyze
content-type: application/json
Authorization: Bearer {{apiKey}}
X-Customer-Code: {{customerCode}}

{
"filename": "actors.csv",
"sourceSchema": {
"type": "object",
"properties": {
"actor_id": { "type": "string" },
"name": { "type": "string" },
"birthYear": { "type": "string" },
"deathYear": { "type": "string" },
"primaryProfession": { "type": "string" }
},
"required": ["actor_id", "name"]
},
"recordIdentifierField": "actor_id",
"transformers": [
{ "type": "javascript", "config": { "expression": "_.set(data, 'birth_year', data.birthYear==='\\\\N' ? undefined : _.toNumber(data.birthYear))" } },
{ "type": "javascript", "config": { "expression": "_.set(data, 'death_year', data.deathYear==='\\\\N' ? undefined : _.toNumber(data.deathYear))" } },
{ "type": "javascript", "config": { "expression": "_.set(data, 'professions', _.split(data.primaryProfession, ','))" } },
{ "type": "javascript", "config": { "expression": "_.omit(data, ['birthYear', 'deathYear', 'primaryProfession'])" } }
],
"collectionCode": "actor",
"parseOptions": {
"format": "DELIMITED",
"delimiter": ",",
"quoteChar": "\"",
"escapeChar": "\""
}
}

The result of the above call to the analyze endpoint will look like this:

{
"nbrIssues": 0,
"nbrFinalRecords": 40151,
"issues": [],
"validRecords": [
{
"actor_id": "nm0130335",
"name": "Manuel Callau",
"birth_year": 1946,
"professions": [
"actor"
]
},
{
"actor_id": "nm0130429",
"name": "Luis Callejo",
"birth_year": 1970,
"professions": [
"actor",
"producer"
]
},
{
"actor_id": "nm0130435",
"name": "María Fernanda Callejón",
"professions": [
"actress"
]
},
{
"actor_id": "nm0130438",
"name": "Christopher Callen",
"professions": [
"actress",
"soundtrack"
]
},
{
"actor_id": "nm0130502",
"name": "Dayton Callie",
"birth_year": 1946,
"professions": [
"actor",
"writer",
"producer"
]
},
{
"actor_id": "nm0130536",
"name": "James Callis",
"birth_year": 1971,
"professions": [
"actor",
"writer",
"producer"
]
},
{
"actor_id": "nm0130736",
"name": "Daniel Caltagirone",
"birth_year": 1972,
"professions": [
"actor"
]
},
{
"actor_id": "nm0130860",
"name": "Mhairi Calvey",
"birth_year": 1988,
"professions": [
"actress",
"producer",
"writer"
]
},
{
"actor_id": "nm0131060",
"name": "Alejandro Camacho",
"birth_year": 1954,
"professions": [
"actor",
"producer",
"director"
]
},
{
"actor_id": "nm0131246",
"name": "Felipe Camargo",
"birth_year": 1960,
"professions": [
"actor",
"assistant_director"
]
}
]
}

Performing the Import

POST https://io.conscia.ai/vue/_api/v1/buckets/incoming/files/_import
content-type: application/json
Authorization: Bearer {{apiKey}}
X-Customer-Code: {{customerCode}}

{
"skippedBucketCode": "skipped",
"processedBucketCode": "processed",
"invalidBucketCode": "invalidated",
"filenamePattern": "actors.csv",
"sourceSchema": {
"type": "object",
"properties": {
"actor_id": { "type": "string" },
"name": { "type": "string" },
"birthYear": { "type": "string" },
"deathYear": { "type": "string" },
"primaryProfession": { "type": "string" }
},
"required": ["actor_id", "name"]
},
"recordIdentifierField": "actor_id",
"transformers": [
{ "type": "javascript", "config": { "expression": "_.set(data, 'birth_year', data.birthYear==='\\\\N' ? undefined : _.toNumber(data.birthYear))" } },
{ "type": "javascript", "config": { "expression": "_.set(data, 'death_year', data.deathYear==='\\\\N' ? undefined : _.toNumber(data.deathYear))" } },
{ "type": "javascript", "config": { "expression": "_.set(data, 'professions', _.split(data.primaryProfession, ','))" } },
{ "type": "javascript", "config": { "expression": "_.omit(data, ['birthYear', 'deathYear', 'primaryProfession'])" } }
],
"collectionCode": "actor",
"parseOptions": {
"format": "DELIMITED",
"delimiter": ",",
"quoteChar": "\"",
"escapeChar": "\""
}
}

The result should look like this:

{
"filenamesToProcess": [
"actors.csv"
],
"filenamesToSkip": [],
"processedFiles": [
{
"filename": "actors.csv",
"nbrValidRecords": 40151,
"nbrValidationIsses": 0,
"nbrInserts": 40151,
"nbrUpdates": 0,
"message": "Successfully loaded file: actors.csv into actor. "
}
],
"hasInvalidRecords": false
}