Coming to the most critical part, for which we had been preparing until now, the Data Ingestion. There are different tools and ingestion methods used by Azure Data Explorer, each under its own categorized target scenario. We will uncover each of these categories one at a time.
Data Ingestion Methods
The three main categories under which the data ingestion method has been classified. They are –
Ingestion using managed pipelines
The first one in the series is the ingestion using managed pipelines. ADX offers pipelines using Event Grids, Event Hubs, and IoT Hubs to ingest data, which can be managed from Azure Portal. This is useful in cases where the organization wishes to have the external service perform management that includes data retrieval, monitoring, alerting, manage throttling, etc. In other words, it is used when you would like to have another service manage your data ingestion instead of developing and supporting it yourself.
Before we actually make use of Event Hub, Event Grid, or IoT Hub for data ingestion into ADX, it is important for us to understand what they are and where they actually fit in the overall data analytics scenario
Azure Event Hubs are big data pipelines. It facilitates the capture, retention, and replay of telemetry and event stream data. There can be multiple concurrent sources that allow the telemetry and event data to be made available to a variety of stream-processing infrastructures and analytics services.
Event Hubs are capable of receiving and processing millions of events per second and make them available either as data streams or bundled event batches. Event Hubs provide a single solution that enables rapid retrieval of data for near real-time processing.
Event Hubs have the following characteristics –
- Low latency
- Confirm to at least one delivery
- Capable of receiving and processing millions of events per second
Event Grid is event-driven and enables reactive programming. It is an event routing service that uses a publish-subscribe model (the pub-sub), where the publishers emit events, but have no expectation about which events are handled. Subscribers, on the other hand decide which events they want to handle. It simplifies the delivery of events between a publisher and the subscriber.
They are not a data pipeline as Event Hubs and do not deliver the actual object or in other words it does not include the data transfer, but the notifications of the event that has occurred on the publisher. These notifications are then consumed by the publisher, an Event Hub, which then routes the events to the ADX cluster.
The can be integrated with third-party services as well. It simplifies event consumption and lowers costs by eliminating the need for constant polling.
In case there is a need to ingest blobs from your storage account into Azure Data Explorer, you can create an Event Grid data connection, which sets an Azure Data Grid subscription, which helps in routing the events from your storage account to ADX using Event Hubs. In this way you can also chain up the Event Grid and Event Hub.
For more information on how to setup the ingestion pipeline to ingest blobs from storage account to ADX using Event Grid, routed via Event Hub, click here
Event Grids have the following characteristics:
- Dynamically scalable
- Low cost
- Confirm to at least one delivery
- Supports dead-lettering
You must be thinking of what the dead-lettering is?
The purpose of the dead-letter queue is to hold messages that cannot be delivered to any receiver, or messages that could not be processed. Messages can then be removed from the DLQ and inspected separately. The dead-letter queue (abbreviated as DLQ) can neither be created, nor it can be deleted or managed separately from the main entity.
IoT Hubs are used as pipelines for data transfer from the approved IoT devices to ADX. It is similar to Event Hubs in functionality, wherein, it acts as a central message hub for communications between the IoT devices sending telemetry data and the IoT applications. This data is captured and sent to ADX for data analysis in near real time (NRT).
Ingestion using connectors and plugins
Then comes ingestion using connectors and plugins. We have different types of connectors available today like – Power Automate (formerly known as MS-Flow), Kafka, Apache Spark and plugins like Logstash.
Power Automate (MS Flow)
This can be used to perform multiple actions when used with Azure Data Explorer. We can use the ADX commands (Kusto Commands) to perform tasks. Some of the tasks that can be performed are:
- Trigger ingestion of data from other databases to Azure and vice-versa
- Create and share ADX reports with tables and charts via emails
- Share ADX query results as notifications based on preset conditions
- Schedule actions to be performed on the ADX cluster
- Push data to Power BI datasets for creating dashboards
One point worth mentioning here is that this is currently in Preview and is soon going to be generally available.
For more information on Power Automate connector to ADX, please check this link to Microsoft docs
Azure Data Factory (ADF)
Azure Data Factory is an Azure integration service for orchestrating the entire process of converting humongous amounts of raw data into meaningful and actionable business insights. It has been built to perform ETL, Hybrid ETL, and Data Integrations. ADF can connect with more than 90 sources for data transfers.
ADF can be integrated with Azure Data Explorer to copy bulk data from various sources into ADX. We can also use the Azure Data Factory command activity to run the ADX control commands within the Azure Data Factory data driven workflow.
Using Azure Data Factory with ADX brings in lot of benefits.
- It is easy to setup
- It can be used to work with lots of on-premises and cloud based data stores
- The data transfer is secured as it is transferred over HTTPs
- It gives a high performance
Azure Data Factory, however, does not support streaming and works either periodically or based on triggers.
For more information on how to run ADX control commands within ADF, you see the step by step guide on how to setup.
See step by step guide to integrate Azure Data Explorer with ADX
Kafka Connector for ADX
Apache Kafka being a distributed streaming platform, helps in setting up ingestion pipelines for real-time streaming data set systems securely and reliably. It is also simple to use, which helps in quickly setting up the connectors. The connector from Kafka serving for Azure Data Explorer is the ADX Kafka Sink.
For more information Connector for ADX, please check this link to Microsoft docs
Azure Data Explorer Connector for Apache Spark
This is an open-source connector that can run on any Spark cluster. It helps in scenarios, where you want to quickly build scalable applications, which can be used to move large volumes of data. The powerful combination of Azure Data Explorer and Apache Spark allows you to build fast and scalable analytics and ETL solutions.
Additionally, when using ADX connector for Apache Spark, you can work in both, batch and streaming modes. Having said that, ADX becomes a data store for standard Spark source and sink operations, thereby allowing read, write, and writeStream operations.
Now, the obvious question is, what is Apache Spark Source and Sink? Apache Spark Source is when the data is read from the Azure Data Explorer, whereas, Spark Sink is when the data is written to Azure Data Explorer
Programmatic Ingestion using SDKs
The third one is the programmatic ingestions, where Azure provides SDK’s in different languages, which can be used to query and ingest data into ADX. One of the best parts is that these SDK’s are pre-optimized such that they help in reducing the ingestion costs and it is done by minimizing storage transactions during and after the ingestion process. This is useful when you wish to set the pipeline to exactly match your requirements and simultaneously lower the cost, but comes with an overhead of managing and supporting your own ingestion pipeline.
The available SDK’s and open-source projects are in .Net, Python, Java, Node JS, GO SDK and REST API. For the Programmatic Ingestion, there are techniques that are used in different scenarios as per the ingestion needs. These can be through the ADX data management services or batch ingestion using the SDK. It can also be done using control commands, ingesting data from Blobs, query or inline ingestion.
When working with programmatic ingestion, you need to work with ingestion properties that affect the way the data is ingested and mapped in the existing table columns. Apart from mapping, you can also use ingestion properties for tagging and work with the creation time. To learn more about the ingestion properties, click here.
Tools for Data Ingestion
Apart from ingestion using pipelines, connectors, plugins and API’s there certain tools available to perform ingestion tasks. This can done using tools like –
One-click ingestion is another way for quickly ingesting data into Azure Data Explorer. It performs automated tasks behind the scenes like creating tables and mapping structures in ADX based on the source of the data.
One-click ingestion can be scheduled to trigger at a specific event, as a continuous ingestion process, or as a one-time process. Best part is that it is a quick, intuitive wizard-based process that can ingest data from different kind of sources with a variety of formats, like CSV, JSON, TSV, SCSV, SOHSV, TSVE, and PSV.
Click below links for a step by step guide on –
Ingesting into a new table from a container in CSV format
Ingesting into an existing table from a local file in JSON format
LightIngest is a command line utility, which is used for ad-hoc data ingestion into ADX with limited functionality. To use the tool, you need to download it. You can download it using the Nuget package or using WinRAR. Links to both are given below
As mentioned earlier, LightIngest can be used for ad-hoc data ingestion with limited functionality, it can pull data from the local folder or from the Azure blob storage container.
Click on how to install and use LightIngest for more details.
KQL Ingest Control Command
Apart from the above Ingestion methods and tools, there are a number of ways in which the data can be ingested into the Azure Data Explorer engine, using the KQL commands.
The query language for Azure Data Explorer (Kusto) has a number of commands that can be used to ingest data directly into ADX. Kusto Query Language ingestion commands are appropriate only for data exploration and prototyping because it bypasses the data management services. So this method must not be used for production purposes or scenarios involving high volumes of data.
There are three types of ingestion commands that can be used. They are –
In this method, the data to be ingested is sent to the processing engine as a part of the command itself. For this, you need to use the .ingest inline command.
Ingest From Query
In this method the data to be ingested is indirectly sent to the ADX processing engine by the KQL control commands. To learn more, click here.
Ingest from Storage
In this method the data to be ingested into the ADX engine for processing is stored in the external storage and is pointed-to the engine by the KQL command .ingest into.
Part – 1: Data Science Overview
Part – 2: Understanding Azure Data Explorer
Part – 3: Azure Data Explorer Features
Part – 4: Azure Data Explorer Service Capabilities
Part – 5: Creating the ADX Environment
Part – 6: The Kusto Query Language
Part – 7: Data Obfuscation in Kusto Query Language
Part – 10: Managing Azure Data Explorer Cluster