Don't Use Cosine Similarity Carelessly!
2025-01-14

This article explores the risks of over-relying on cosine similarity for vector comparisons in data science. The author argues that while computationally simple, cosine similarity often fails to capture semantic similarity and can be easily misled by superficial patterns like writing style and typos. The article uses examples to illustrate this problem and proposes several improved methods, including: directly using LLMs for comparison, fine-tuning or transfer learning to create task-specific embeddings, pre-prompt engineering, and text preprocessing. The author emphasizes the importance of choosing appropriate similarity metrics based on specific needs, rather than blindly using cosine similarity.