Webtagr - Technology News Summarizer

Improving LLM Fine-tuning Through Iterative Data Curation

2025-08-08

Researchers significantly improved the performance of large language models (LLMs) by iteratively curating their training data. Experiments involved two LLMs of varying sizes (Gemini Nano-1 and Nano-2) on tasks of different complexity, using ~100K crowdsourced annotations initially suffering from severe class imbalance (95% benign). Through iterative expert curation and model fine-tuning, performance substantially increased. The models reached approximately 40% positive examples and a Cohen's Kappa of ~0.81 (lower complexity) and ~0.78 (higher complexity), approaching expert-level performance, highlighting the crucial role of high-quality data in LLM training.

(research.google)

AI Data Curation Model Fine-tuning

Earthquake Early Warning: The Speed-Accuracy Tradeoff in Magnitude Estimation

2025-07-23

A major challenge in Earthquake Early Warning (EEW) systems is real-time estimation of earthquake magnitude. Magnitude determines the extent of shaking and who needs warning. Underestimation risks missed warnings, while overestimation leads to false alarms and erosion of public trust. The key challenge lies in balancing speed and accuracy; initial data is limited, but delaying alerts reduces warning time. Over the past three years, we've significantly improved magnitude estimation, reducing the median absolute error from 0.50 to 0.25. Our accuracy now rivals, and in some cases surpasses, established seismic networks.

(research.google)

Tech Magnitude Estimation Real-time Monitoring

MUVERA: Efficient Multi-Vector Retrieval

2025-06-26

Modern information retrieval relies on neural embedding models, but while multi-vector models offer higher accuracy, their computational complexity leads to inefficiency. Researchers introduce MUVERA, a novel algorithm that transforms complex multi-vector retrieval into simpler single-vector maximum inner product search (MIPS) by constructing fixed dimensional encodings (FDEs). This significantly improves efficiency without sacrificing accuracy. The open-source implementation is available on GitHub.

(research.google)

AI multi-vector retrieval

Veo Gen 3: Generalizing Video Generation

2025-05-16

Google's latest breakthrough in video generation, Veo, now boasts a third generation capable of generalizing across diverse tasks. Trained on millions of high-quality 3D synthetic assets, Veo excels at novel view synthesis, transforming product images into consistent 360° videos. Importantly, this approach generalizes effectively across furniture, apparel, electronics, and more, accurately capturing complex lighting and material interactions—a significant improvement over previous generations.

(research.google)

AI

Google Boosts Developer Productivity with Hybrid Semantic ML Code Completion

2025-05-15

Google researchers have developed a novel Transformer-based hybrid semantic machine learning code completion system that combines machine learning (ML) and rule-based semantic engines (SEs) to significantly improve developer productivity. The system integrates ML and SEs in three ways: 1) re-ranking SE's single-token suggestions using ML; 2) applying single and multi-line completions using ML and checking correctness with the SE; and 3) using single and multi-line continuation by ML of single-token semantic suggestions. A three-month study with 10,000+ Google internal developers showed a 6% reduction in coding iteration time with single-line ML completion. Currently, over 3% of new code is generated from accepting ML completion suggestions. The system supports eight programming languages and incorporates semantic checks to ensure code correctness, significantly boosting developer trust and efficiency.

(research.google)

Development

Whisper's Embeddings Surprisingly Align with Human Brain Activity During Speech

2025-03-26

A study reveals a surprising alignment between OpenAI's Whisper speech recognition model and the neural activity in the human brain during natural conversations. By comparing Whisper's embeddings to brain activity in regions like the inferior frontal gyrus (IFG) and superior temporal gyrus (STG), researchers found that language embeddings peaked before speech embeddings during speech production, and vice-versa during comprehension. This suggests Whisper, despite not being designed with brain mechanisms in mind, captures key aspects of language processing. The findings also highlight a 'soft hierarchy' in brain language processing: higher-order areas like the IFG prioritize semantic and syntactic information but also process lower-level auditory features, while lower-order areas like the STG prioritize acoustic and phonemic processing but also capture word-level information.

(research.google)

AI

Groundbreaking Research: The Power Team Behind the Success

2025-03-03

This paper is the result of a close collaboration with Asaf Aharoni, Avinatan Hassidim, and Danny Vainstein. The team also extends gratitude to dozens of individuals from Google Research, Google DeepMind, and Google Search, including YaGuang Li and Blake Hechtman, for their reviews, insightful discussions, valuable feedback, and support. Their contributions were crucial to the completion of this research.

(research.google)

AI

Google AI Breakthrough: A Giant Team Effort Revealed in Acknowledgements

2025-02-19

This paper's acknowledgements reveal a massive collaborative effort involving numerous researchers from Google Research, Google DeepMind, and Google Cloud AI, along with collaborators from the Fleming Initiative, Imperial College London, Houston Methodist Hospital, Sequome, and Stanford University. The extensive list highlights the collaborative nature of the research and thanks many scientists who provided technical and expert feedback, as well as numerous Google internal teams providing support across product, engineering, and management. The sheer length of the acknowledgements underscores the massive team effort behind large-scale AI projects.

(research.google)

AI

OMG! Nearly All Binary Searches and Mergesorts Are Broken

2025-01-11

Google software engineer Joshua Bloch revealed a nearly two-decade-old bug lurking in binary search algorithms, found in both the JDK and Jon Bentley's 'Programming Pearls'! The bug stems from the line `int mid = (low + high) / 2;`, causing integer overflow and array index out-of-bounds exceptions when the sum of `low` and `high` exceeds the maximum positive integer value. This bug only manifests with massive datasets, making it particularly dangerous in today's big data world. The article explores various fixes and emphasizes that bugs can persist even with rigorous testing and proofs, urging programmers to remain cautious and humble.

(research.google)

Development algorithm bug binary search mergesort

Google Expands Global Solar Potential Assessment Using Satellite Imagery and Machine Learning

2024-12-19

Google researchers have expanded the Google Maps Platform Solar API's coverage in the Global South by applying machine learning models to satellite imagery to generate high-resolution digital surface models and roof segmentation maps. This innovation overcomes limitations in traditional methods of data acquisition and processing, providing solar potential assessment data for 1.25 billion buildings globally and accelerating the adoption of renewable energy worldwide. The project leverages satellite data to increase data update frequency and reduce costs, particularly beneficial in data-scarce regions.

(research.google)

Tech solar energy satellite imagery