Webtagr - Technology News Summarizer

Extracting MRR from Stripe Data: Pitfalls and SQL Implementation

2025-05-16

This article details how to extract data from the Stripe API and calculate Monthly Recurring Revenue (MRR). The author highlights the unreliability of using Stripe's `subscriptions` object directly, as it only contains the latest subscription state. The correct approach uses `invoice line items`, handling discounts, varying billing cycles (monthly, quarterly, annually), and more. The article provides detailed SQL code, covering data cleaning, cycle normalization, and the final MRR metric calculations, including new MRR, churn MRR, expansion MRR, and reactivation MRR. The article emphasizes the method's adaptability and customizability, and recommends an application to simplify MRR calculations.

DeepSeek's smallpond and 3FS: Scaling DuckDB to Petabytes

2025-03-02

DeepSeek AI has released smallpond and 3FS, designed to extend the DuckDB database to handle petabyte-scale datasets. smallpond is a lightweight distributed data processing framework enabling DuckDB to process data in parallel across multiple nodes, while 3FS is a high-performance parallel file system leveraging SSDs and RDMA networking for extreme throughput. However, deploying and using these tools is complex, requiring specialized hardware and DevOps expertise. For datasets under 10TB, a single-node DuckDB instance or simpler solutions are more efficient. Only when dealing with massive datasets do smallpond and 3FS show their advantages.

Streaming Data in DuckDB: Conquering Concurrency Limits with Arrow Flight

2025-01-29

Definite's blog post showcases a clever solution to overcome DuckDB's concurrency limitations using Apache Arrow Flight. While DuckDB excels at single-machine analytics, its lack of concurrent writer and reader support restricts its use in real-time streaming scenarios. The 'Duck Takes Flight' Python script builds an Arrow Flight server, enabling concurrent writes and reads to DuckDB. This 200-line solution is efficient, requiring no complex cluster setup, and delivers high-performance stream processing, offering a fresh approach for applications needing fast data movement and on-the-fly querying.