SmallPond: A Lightweight Data Processing Framework

2025-03-02
SmallPond: A Lightweight Data Processing Framework

SmallPond is a lightweight, high-performance data processing framework built on DuckDB and 3FS. It scales to handle petabyte-scale datasets without requiring long-running services and supports Python 3.8-3.12. Its simple API allows for easy data loading, processing, and saving. Benchmarked using GraySort on a cluster of 50 compute and 25 storage nodes running 3FS, SmallPond sorted 110.5 TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66 TiB/min.

Development