Azure Data Explorer – Productivity and Scalability


Now that we understand what an Azure Data Explorer is, it is time for us to take a step further and dig a little deeper to understand Productivity (defined by performance and cost) and Scalability in ADX, that makes it unique.

  • Performance and Cost – Performance in ADX is achieved by fast indexing of all the columns including free-text and dynamic columns. We will talk about the data types in more detail in upcoming modules, but consider these column types as data types. This kind of indexing is what sets apart the ADX from other data analysis tools. There is also a concept of inverted term index as well, which is unique to Azure and sets it apart from others. The inverted index is dependent on the cardinality of the column. The next thing in the line to provide productivity is column compression. The data analysis engine always holds the compressed data even when it is loaded into the cache.

    ADX integrates with the Data Lake Storage (Gen1 and Gen2) and also with Blob Storage, which are for Big Data Analytics, providing high-performance file system with the capability to scale while still keeping the cost low and that is how it helps in reducing your time to insight. Best part is that with Data Lake Storage, the data can be analyzed and queried before the data is ingested into ADX, but this should only be used with historical data or the data which is rarely used.
    undefined
    Courtesy ADX – Data Ingestion Overview

    Additionally, when the data is pulled from the external sources like Data Lake Storage, data is either batched or it is streamed to the Data Manager. The data management service in ADX has the capability to optimize the ingestion throughput of the batched data that is going to the same database and/or table. It also validates the data and converts the data format if it is needed.

    The data management service in Azure Data Explorer also provides advanced data manipulation services. It not only organizes the data, but there are other activities as well that are performed, like – matching schema, indexing, encoding, and data compression. This automatic indexing and compression of data not only helps in quickly and easily access it, but also reduces the amount of storage and SSD used, which in turn helps in saving the cost as well.

    The Microsoft Azure Data Explorer team also recently launched a new feature, which helps you to use the query results cache. This can be set using “query_results_cache_max_age” option as a part of the query that is being used. This helps in achieving better experience and performance.

    "set query_results_cache_max_age = time(1m));"

    In this, if another query being used needs the same results, which is already cached and is also within the time span defined as for the age of that cached results, the data will be returned from the cache. This new feature, therefore, helps in lowering the resource consumption and thereby cost as well.
  • Scalability – The second one is scalability. Azure Data Explorer scales quickly in order to deal with the increasing volume of data and increasing ingestion and query load.

    Since it is very difficult to predict the cluster utilization, a static cluster should of course be out of scope. Rather, there are ways which we can exploit in order for the ADX to respond to the changing needs by scaling. The are two ways in which scaling can happen in Azure Data Explorer. They are –
    • Scaling Up – This is also known as vertical scaling. In this we have the option to change the SKU by going to the Azure portal and change the SKU settings under Scale Up option in the ADX Cluster.
    • Scaling Out – This is also known as horizontal scaling. In this, the number of instances handling the workload is adjusted as per the demanding workload. It can be further classified as –
      • Manual Scaling – It has the default setting that is defined during the creation of the cluster. The only option is to change the instance count manually, which remains static until another change is made (reduced or increased).
      • Optimized Autoscale – This is the recommended method. Optimized autoscale is a fully managed scaling logic to optimize the performance as well as the cost. This configuration scales out or scales in the number of instances based on the workload on the ingestion, data, and query load on the cluster. There are pre-defined rules to determine the autoscaling.
      • Custom Autoscale – This is another way to scale the cluster but, this autoscale logic requires users to configure their own scale out or scale in logic with metric based rules, in order to maintain the cost and performance.

Part – 1: Data Science Overview

Part – 2: Understanding Azure Data Explorer

Part – 4: Azure Data Explorer Service Capabilities

Part – 5: Creating the ADX Environment

Part – 6: The Kusto Query Language

Part – 7: Data Obfuscation in Kusto Query Language

Part – 8: Data Ingestion Preparation: Schema Mapping

Part – 9: Overview of data ingestion in Azure Data Explorer

Part – 10: Managing Azure Data Explorer Cluster

Blog at WordPress.com.

Up ↑

%d bloggers like this: