Faster Than memcpy: A Benchmark of Custom Memory Copying Methods

2025-08-11

While profiling, the author found that `memcpy` was a bottleneck for large binary messages. Several custom memory copy methods were implemented and benchmarked, including variations using REP MOVSB and AVX instructions (aligned, stream aligned, and stream aligned with prefetching). For small to medium sized messages, the unrolled AVX version performed best. For large messages (>1MB), the stream aligned AVX version with prefetching was fastest, but its performance on small messages was abysmal. The conclusion? `std::memcpy` offers a good balance of performance and adaptability; custom methods are unnecessary unless performance is paramount.