There are six different service capabilities of Azure Data Explorer, and we will discuss them one at a time to help understand the framework of ADX, and also help in clearly understanding how ADX works behind the scenes.
ADX Service Capabilities
- Data Storage: There are two ways to store the data –
- Column Store (default) – Data is organized in fields by columns and is considered optimal for analytical query modeling. Storing the data in the sharded column store has its own set of benefits. It allows huge amounts of data sets to be stored and also allows for better compression.
- Row Store – In row store the data is organized in records without indexing. The main benefit is quick writing of data. It is optimal for use cases where near real time is required and the data volume is small to medium.
- Indexing: This is one thing that sets ADX apart from other exploration tools available today as it provides better performance. It helps ADX in deciding which data extents should be targeted as a part of the query. This in turn helps in better performance as it allows scanning only a small subset of the data.The underlying database has a unique inverted index design based on cardinality. The index is not just at the shard level, but there are even more granular level indexes within the shards as well. They are considered as a low-granularity indexes, and if one of those indexes are hit, it scans only a part of the shard/extent relevant to the query.
One other benefit that is achieved behind the scenes is that the small shards are manipulated and merged together, which improves both the compression and indexing. This is done because the data is continuously ingested from the sources and we need to have low query latency.
Apart from the inverted index, ADX stores metadata and statistics for each column. These statistics help along with the data sharding concept we discussed above. For example – if a column is a numeric column or datetime column or a timespan column, ADX will also store the maximum and minimum values of the extent of the data, and when the user requests the data from the store with certain conditions, it will be compared and only relevant extents are scanned and returned as results.
Column Compression: We discussed that in ADX we have the column compression. LZ4 is one of the compression algorithms that is utilized, providing excellent performance and reasonable compression ratio. The data is always held in the compressed mode for faster data transfers. Even when the data is loaded into RAM, it is compressed.
Metadata Storage: Apart from the data, ADX also stores different kinds of metadata information that describes the data. A few of them are mentioned below –
- Table schema
- Policy objects
- Security policies
- External table details
- Function declarations
We will discuss more about the policy object in later modules when we are working with the Kusto Query Language.
- Compute/Storage/Network Isolation: In Azure, the compute, storage, and networking services are isolated in order to reduce interdependencies and to achieve performance. This separation of concern also helps in individually handling the issues and fixing them in case need be. This is what is followed by ADX as well. In case the compute needs to be scaled up depending upon the CPU load, it can easily be done. Also, multiple computes can access the same storage.
- Compute Data Caching: We discussed earlier that the data is held in the RAM in the compressed form only. This data is decompressed only when the actual query needs it. ADX makes full use of the SSD storage. The most relevant information is cached in the SSD to remain very close to the CPU for faster data retrieval. There are three different tiers:
- Azure Blob Storage
- Azure Compute SSD
- Azure compute RAM
One other point worth mentioning here is that, although the caching is defined at the database level, which is inherited by the tables in the database, but it is also possible to configure the cache time span for each individual table, thereby overriding the caching configured at the database level. Data stored in the hot cache is accessible much faster by the queries, as there is no need to fetch the large data again from the storage. The data is stored in the cache as a result of a previous query.
Part – 1: Data Science Overview
Part – 2: Understanding Azure Data Explorer
Part – 3: Azure Data Explorer Features
Part – 5: Creating the ADX Environment
Part – 6: The Kusto Query Language
Part – 7: Data Obfuscation in Kusto Query Language
Part – 10: Managing Azure Data Explorer Cluster