Lessons Learned from Unpacking VertiPaq: A Developer’s Journey

Igor Cotruta
6 min readMay 15, 2024

--

VertiPaq is the powerhouse behind Microsoft’s in-memory column store technologies that power memory structures from PowerPivot in Excel and Power BI data models to column store indexes and Analysis Services tabular models in SQL Servers. Over the past few months, I’ve been on a mission to decode the VertiPaq protocol and integrate it into a DuckDB extension. Given the rich DuckDB ecosystem, this meant opening the technology to cross-platform and cross-language data integration. I’m a one-band man, so I’ve started with a reduced scope to only parsing PBIX files and assessing the extent of unsolvable challenges. Here are a few lessons I’ve learned along the way.

Understanding VertiPaq’s Compression Layers

Sketch to visualize XPRESS9 decompression and ABF decoding

There are up to four(five*) layers of compression that you need to undergo to reconstruct the data from VertiPaq:

  1. ZIP compression (compression=0)*
  2. XPRESS9 compression
  3. Huffman compression
  4. Run-Lenght Encoding
  5. Bit Packing

That’s all in addition to deserialising and decoding various metadata-specific structures.

Let’s examine these layers. The first relates to the PBIX file, which is a ZIP archive. The file we’re after is called DataModel. I wouldn’t call this a compression layer because the actual file is stored as it is: compression=0.

Next, we have XPRESS9 compute-optimised compression, commonly used across various Microsoft products, including Hibernate files and Microsoft Exchange. Decompressing the XPRESS9 container reveals an Analysis Backup File (ABF).

ABF is the backup format for SQL Server Analysis Services (SSAS), and in the context of tabular models, its size reflects the memory needed to load the data model. Although the ABF format is a byte stream with some additional metadata related to its contents, when opening a PBIX file, you can browse the contents of the ABF in C:\Users\{Username}\Microsoft\Power BI Desktop Store App\AnalysisServicesWorkspaces\AnalysisServicesWorkspace_{GUID}\Data\{GUID}.db

Every column in the model has a column data storage file (*.IDF) per segment. The underlying data is compressed using a hybrid of Run-Length Encoding (RLE) and Bit-Packing.

Depending on the column encoding (Hash/Dictionary), it may have an associated *.dictionary file, which can contain an additional compression layer for all the unique text values. The compression used here is classic Huffman.

Navigating VertiPaq File Structures

Let’s return to the previous picture and discuss the different file-based data structures that make up the VertiPaq store.

Metadata Database

The metadata.sqlitedb file is a SQLite database that contains all the relevant metadata. This database holds extensive information beyond the metrics available from SSAS DMVs, which is critical for reconstructing all column-based data structures into a cohesive tabular data model.

Root-Level Folders

At the root level, there are four distinct folder types:

Partition Subfolders

Each folder has subfolders mapped to partitions, following the pattern {PartitionID}.prt. Within these partition subfolders, every column per segment contains a pair of files:

These files correspond to the data and metadata associated with the encoded data.

Hash/Dictionary Encoding Files

At the level of partition subfolders, depending on the column’s encoding (Hash/Dictionary), you’ll find:

It was a relief that this folder structure perfectly matched a conceptual model of VertiPaq I built a while ago.

Decoding VertiPaq Files

Great! Now that we have all these VertiPaq files, what’s next? This is where we dive into the Microsoft Open Specifications documents and match the binary file layouts to the protocol subfolder specifications.

I chose two key documents for this task: [MS-XLDM]: Spreadsheet Data Model File Format and [MS-SSAS-T]: SQL Server Analysis Services Tabular Protocol. [MS-SSAS-T] helped me understand the data in metadata.sqlitedb, while [MS-XLDM] was crucial for making sense of ABF and VertiPaq files.

These documents were invaluable for reconstructing VertiPaq structures. However, translating the specifications into working code took several thorough readings and many sleepless nights. The specs make much more sense once you see the code in action.

Another tool that proved essential in parsing VertiPaq structures is Kaitai Struct. This language-neutral toolkit for writing binary parsers includes a DSL and various tools for visualisation and compilation. Writing DuckDB extensions in C++ is no small feat, so I validated some of the parsers and assembly logic in Python and JavaScript before tackling the DuckDB scanner. Kaitai Struct made this process much easier and was pivotal in developing the DuckDB PBIX extension.

Assembling a VertiPaq Table

Now that we have the metadata and decoded VertiPaq structures let’s build a table. Assembling a table from columnar constructs involves the following five steps:

  1. Read Metadata: Query metadata related to the table from metadata.sqlitedb.
  2. Parse IDFmeta Files: Enhance the metadata with additional information from the idfmeta files.
  3. Unpack the IDF Files: Use the metadata to unpack the bits and apply run-length encoding (RLE) to reconstruct the vector of internal IDs.
  4. Parse Dictionary Files or Scale Internal IDs: Decode dictionary files for columns using dictionary-based encoding. For hash-based columns, map the internal IDs using the metadata’s BaseId and Magnitude.
  5. Assemble Columns: Map internal IDs to dictionary indexes and combine column data one column at a time, following the order specified in the metadata.

Other structures in VertiPaq serve specific purposes during data queries rather than table reconstruction:

  • Hashes: Used to filter IDF data without unpacking.
  • Relationship Structures: Facilitate filter propagation across related tables.
  • Hierarchy Data Structures: Enhance MDX queries or add functionality to existing columnar structures.

DuckDB-PBIX-Extension: Bridging VertiPaq and DuckDB

The DuckDB-PBIX-extension is the culmination of my efforts to decode VertiPaq and leverage DuckDB’s powerful, cross-platform capabilities. DuckDB was chosen for its incredible support across various platforms and multiple programming languages, making it an ideal foundation for this project.

This open-source extension demonstrates how to parse PBIX files using DuckDB, essentially doing a reverse direct query and writing SQL on top of VertiPaq.

The extension introduces two table-valued functions: pbix_meta and pbix_read. These functions take the PBIX file name and the name of the metadata or data model table as arguments. In my tests, I could parse the metadata of enormous PBIX files in a fraction of a second.

The project is in its infancy; however, given DuckDB’s rich ecosystem, I’ve already been able to parse PBIX files from my phone, showcasing the true potential of “write once, run everywhere.”

Give it a try right in your browser.

Originally published at http://querypower.com on May 15, 2024.

--

--

Igor Cotruta
Igor Cotruta

Written by Igor Cotruta

Business intelligence developer interested in shiny charts and dim lit data models

No responses yet