"State of AI Report 2024" (1) - AlphaGeometry, Synthetic Data, RAG

The report contains a lot of details, and I mainly focus on topics that I haven't heard before or have less understanding of. Today, I will translate three sections first:

AlphaGeometry

due to insufficient reasoning capabilities and lack of training data. However, AlphaGeometry introduces a symbolic derivation engine that effectively addresses this shortfall.

The team from Google DeepMind and New York University (NYU) used a symbolic engine to generate millions of synthetic theorems and proofs, and trained a language model from scratch using this data.

, a performance close to that of gold medalists at the International Mathematical Olympiad. The best performance by other AIs was solving only 10 problems.

In addition, AlphaGeometry demonstrated good generalization ability. For example, it discovered that a detail in a 2004 International Mathematical Olympiad question was not necessary for the proof.

Synthetic Data

When I previously shared the HAI report, I mentioned the discussion on synthetic data:

Supporters

Nowadays, synthetic data is being increasingly adopted.

In last year's report, there were differing views on synthetic data: some found it very useful, while others worried that it could lead to model failure due to the accumulation of errors. Now, it seems that perspectives are gradually shifting:

Synthetic data is not only the main source of training data for the Phi series models, but Anthropic also used synthetic data when training Claude 3, helping to make up for possible missing scenarios in the training data.
Hugging Face generated over 30 million files and 25 billion synthetic texts (including textbooks, blog posts, and stories) using Mixtral-8x7B Instruct to recreate the training dataset for Phi-1.5, which they call
a series of models specifically designed for generating synthetic data, with permissive usage licenses. Meta's Llama can also be used to generate synthetic data.
Furthermore, high-quality instruction data can be directly extracted from aligned LLMs to generate synthetic data, optimized using technologies like Magpie. Models fine-tuned in this way sometimes rival the performance of Llama-3-8B-Instruct.

Opponents

Although model developers are moving quickly, researchers are evaluating whether there is a tipping point where synthetic data causes model collapse and whether any mitigation measures are effective.

that model collapse occurs across a variety of AI architectures, including fine-tuned language models. This challenges the view that pre-training or periodic exposure to small amounts of real data can prevent model degradation (measured by perplexity scores).

", because continuous access to diverse, human-generated data will become increasingly critical to maintaining model quality. However, these results primarily focus on a scenario where real data is gradually replaced by synthetic data across generations. In practice, real and synthetic data are typically accumulated simultaneously.

Other studies indicate that as long as the proportion of synthetic data is not too high, model collapse can generally be avoided.

RAG

While retrieval and embeddings are not new concepts, interest in them has grown significantly with the rise of Retrieval-Augmented Generation (RAG), driving improvements in the quality of embedding models.

Following the conventional path of success for large language models (LLMs), scale brings significant performance gains (for example, GritLM has approximately 47 billion parameters, compared to previous embedding models which typically only had 110 million parameters).

is a visual-language embedding model that improves retrieval not only by using text embeddings but also by leveraging the visual structure of documents.

is at the top.

Context is a key driver of performance.

Traditional RAG solutions often generate text fragments every 256 tokens using a sliding window (with 128 tokens overlapping the previous section). While this method improves retrieval efficiency, it significantly reduces accuracy.

(from 5.7% to 3.7%), and can be scaled via Anthropic's prompt caching technology.

's impact (measured in recall).

The evaluation issue of RAG remains unresolved.

Many commonly used RAG benchmarks are actually repurposed retrieval or question-answering datasets, which cannot effectively evaluate the accuracy of citations, the importance of each piece of text to the overall answer, or the impact of handling conflicting information.

It provides a large-scale set of complex multidimensional problems, which come from real user queries and require in-depth research and analysis to answer.