Embedding Dimensions: From 300 to 4096, and Beyond

2025-09-08
Embedding Dimensions: From 300 to 4096, and Beyond

A few years ago, 200-300 dimensional embeddings were common. However, with the rise of deep learning models like BERT and GPT, and advancements in GPU computing, embedding dimensionality has exploded. We've seen a progression from BERT's 768 dimensions to GPT-3's 1536 and now models with 4096 dimensions or more. This is driven by architectural changes (Transformers), larger training datasets, the rise of platforms like Hugging Face, and advancements in vector databases. While increased dimensionality offers performance gains, it also introduces storage and inference challenges. Recent research explores more efficient embedding representations, such as Matryoshka learning, aiming for a better balance between performance and efficiency.

Read more

Sampling Big Data: Small Samples, Big Answers

2025-05-31
Sampling Big Data:  Small Samples, Big Answers

Hadley Wickham's recent interview highlighted that many big data problems are actually small data problems, given the right subset, sample, or summary. This post delves into efficient sampling for big data analysis. Using the example of Goatly, a company serving narcoleptic goats, the author demonstrates how to calculate the appropriate sample size for logistic regression. The conclusion is that approximately 2345 samples are needed to accurately represent 100,000 farms. The post also details Python scripts and online tools for sample size calculation, and briefly touches on the concept of statistical power.

Read more

Hacker News: A Decade of Tech Growth

2025-03-18
Hacker News: A Decade of Tech Growth

Starting in 2011, the author began using Hacker News, initially understanding very little of the technical jargon and companies mentioned. However, through daily reading and deep dives into unfamiliar concepts, the author transformed from a data analyst into an engineer confidently deploying code to millions of users. Hacker News provided not only learning resources but also a supportive community, helping the author improve technical skills and writing, ultimately leading to a significant career leap.

Read more
Development technical learning

LLMs: Exploring Arithmetic Capabilities in the Pursuit of AGI

2024-12-24
LLMs: Exploring Arithmetic Capabilities in the Pursuit of AGI

This article explores why large language models (LLMs) are being used for calculation. While LLMs excel at natural language processing, researchers are attempting to make them perform mathematical operations, from simple addition to complex theorem proving. This isn't to replace calculators, but to explore the reasoning capabilities of LLMs and ultimately achieve artificial general intelligence (AGI). The article points out that humans have always tried to use new technology for computation, and testing the mathematical abilities of LLMs is a way to test their reasoning abilities. However, the process of LLMs performing calculations is drastically different from that of calculators; the former relies on vast knowledge bases and probabilistic models, while the latter is based on deterministic algorithms. Therefore, LLM calculation results are not always accurate and reliable, highlighting the trade-off between practicality and research.

Read more