Sampling Big Data: Small Samples, Big Answers

Hadley Wickham's recent interview highlighted that many big data problems are actually small data problems, given the right subset, sample, or summary. This post delves into efficient sampling for big data analysis. Using the example of Goatly, a company serving narcoleptic goats, the author demonstrates how to calculate the appropriate sample size for logistic regression. The conclusion is that approximately 2345 samples are needed to accurately represent 100,000 farms. The post also details Python scripts and online tools for sample size calculation, and briefly touches on the concept of statistical power.
Read more