"2024 Artificial Intelligence Index Report" - 1.1 Model Data Exhaustion Risk

This chapter is somewhat different from my previous understanding.

》, which discussed strategies for using lower-version GPT models to train higher-version models.

In the report's section "Will models run out of data?", it was proposed that whether low-quality or high-quality language data, or even groundbreaking data, will eventually be insufficient to support the training of increasingly larger models.

To address this challenge, many researchers have adopted the approach of training one large language model (LLM) using another LLM and supplementing real data with synthetic data. However, studies indicate that this method has significant flaws: models may lose their ability to remember the true underlying data distribution and start generating outputs with narrow ranges.

The figure below illustrates a trend: models primarily trained on synthetic data become increasingly less diverse in their outputs as generations progress, and their distributions are not broad.

Another study conducted a comparison:

  • Fully synthetic: the model is trained entirely on synthetic data.
  • ): The model is trained using a mixture of synthetic data and real data.

In both cases, the quality of the generated images decreases as the number of training epochs increases.

However, while the synthesis data augmentation loop (with some real data added) shows a lower degree of quality degradation, both methods demonstrate diminishing returns with further training.