Embedding User-Defined Indexes in Apache Parquet Files: No More External Indexes!

2025-07-15

It's a common misconception that Apache Parquet is limited to basic statistics and Bloom filters. This post reveals how to embed custom indexes directly into Parquet files without breaking compatibility. By leveraging footer metadata and offset-based addressing, you can add indexes like distinct value lists for specific columns, dramatically improving query performance, especially for highly selective predicates. The authors detail the mechanism and provide a practical example using Apache DataFusion, showing how to serialize, store, and read these custom indexes. Say goodbye to the complexities and risks of external indexes!

Read more
Development User-Defined Indexes

Apache DataFusion: A Powerful and Extensible Query Engine in Rust

2025-01-16

Apache DataFusion is an extensible query engine written in Rust that uses Apache Arrow as its in-memory format. It offers SQL and DataFrame APIs, boasts excellent performance, and provides built-in support for CSV, Parquet, JSON, and Avro. DataFusion features a full query planner, a columnar, streaming, multi-threaded, vectorized execution engine, and partitioned data sources. It's highly customizable, allowing additions of data sources, query languages, functions, custom operators, and more. Related subprojects include DataFusion Python (Python bindings), DataFusion Ray (distributed version), and DataFusion Comet (Apache Spark accelerator).

Read more
Development Query Engine