Data Ingestion Preparation: Schema Mapping


Schema Mapping

The preparation we have done thus far is for ingesting data into ADX tables for data analysis. But before we begin with data ingestion we need to prepare the table to allow data ingestion and that includes mapping of the schema. This means mapping of the column datatypes of the incoming data with the table columns in the database within ADX.

For ingesting data from files, it is a good practice that the schema is mapped between the source and the destination columns. There are two different ways to map the data.

  1. CSV mapping, which is optional and is ordinal based
  2. JSON or AVRO based, which is mandatory and is path or field name based.

The mapping can be done either with the ingest command or with the pre-created table schema. We will see how the schema mapping is done in our demo, as well as in the next module, where we will see how data ingestion is done via Event Hubs and Event Grids. This schema mapping is also called metadata mapping.

schemamapping

As I already mentioned, for ingesting the data from files, it is a good practice that the schema is mapped between the source and the destination columns. There are two different ways to map the data.

  1. The CSV mapping, which is optional and is the ordinal based – As the name suggests, this mapping is done when the source file is in the csv format and its schema does not match the schema of the database table. One advantage with this mapping is that if the table or even if the column within the table does not exist, it will be created
  2. The JSON or the AVRO based, which is mandatory and is the path or the field name based – When the source file is in the JSON or the Avro format, we use this type of mapping. It has a small dissimilarity with the CSV mapping. In this mapping, the table or the column within the table must exist if the datatype is not mentioned in the ingest command. If the datatype is mentioned, the table or the column within the table gets created if it does not exist.

For both of them, mapping can be performed either by using the ingest command parameter or can also be pre-created on the table and then referenced by the ingest command. This schema mapping is also called metadata mapping.

When the mapping is done via the first method, which is the ingest control command method, it is serialized as a JSON string. Apart from the CSV file format, CSV mapping can also be applied to all the delimiter separated formats, like TSV, PSV, SCSV, and SOHsv apart from CSV.

.ingest into Table123 (@"source1", @"source2") with @'{"csvMapping": "[{\"Name\":\"rownumber\",\"Ordinal\":0},{\"Name\":\"rowguid\",\"Ordinal\":1},...]","format":"csv"}‘

I also mentioned that the mapping can be pre-created on the table. This can be done using the second method, with the create command. In this, the mapping can be referenced by its name from the ingestion command, and we do not need to pass the mapping itself as a part of the command.

.create table MyTable ingestion csv mapping "Mapping1" '[{ "Name" : "rownumber", "DataType":"int", "Ordinal" : 0},{ "Name" : "rowguid", "DataType":"string", "Ordinal" : 1 }]'

Pre-creating table mapping and referencing it in the ingest command is done the same with JSON and AVRO formats as well. The key difference is that instead of using the “name” for the column name, we use the “column” to represent the column name.

Also, as I already mentioned, in the CSV mapping, we map the columns using the ordinal number of the source column, whereas, for the JSON file, we use the “path” and in the AVRO format we use the “field” for mapping.

.create table MyTable ingestion json mapping "Mapping1" '[{ "column" : "rownumber", "datatype" : "int", "path" : "$.rownumber"},{ "column" : "rowguid", "path" : "$.rowguid" }]‘

Apart from the create command, we also have the alter, show and drop commands to help us with the schema mapping. The alter command is used to modify the existing mapping, whereas, we can use the show command to see the schema mappings on the table. The drop command drops the ingestion mapping from the table.

Below are the commands to perform different types of actions:

  • Create schema mapping on the table
.create table MyTable ingestion json mapping "Mapping1" '[{ "column" : "rownumber", "datatype" : "int", "path" : "$.rownumber"},{ "column" : "rowguid", "path" : "$.rowguid" }]‘
  • Alter schema mapping of the table
.alter table MyTable ingestion json mapping "Mapping1" '[{ "column" : "rownumber", "path" : "$.rownumber"},{ "column" : "rowguid", "path" : "$.rowguid" }]‘
  • Show schema mapping of the table
.show table MyTable ingestion json mappings “Mapping1”
  • Drop schema mapping from the table
.drop table MyTable ingestion json mappings "Mapping1"

Part – 1: Data Science Overview

Part – 2: Understanding Azure Data Explorer

Part – 3: Azure Data Explorer Features

Part – 4: Azure Data Explorer Service Capabilities

Part – 5: Creating the ADX Environment

Part – 6: The Kusto Query Language

Part – 7: Data Obfuscation in Kusto Query Language

Part – 9: Overview of data ingestion in Azure Data Explorer

Part – 10: Managing Azure Data Explorer Cluster

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Blog at WordPress.com.

Up ↑

%d bloggers like this: