Introduction to LLama-3

. Today, I'll branch off and talk about LLama-3.

First, let’s take a look at the rankings from the LMSYS Chatbot Arena Leaderboard on April 19, 2024 🏆, where LLama-3 ranks fifth.

parameters in the future. Currently, the large model with 400B parameters is still in training. In the coming months, Meta will launch several models with new features, including multi-modal capabilities, support for multiple language interactions, longer context windows, and more powerful overall capabilities.

It can currently be used on Meta AI's official website - https://www.meta.ai, or you can download the model and deploy it yourself - https://llama.meta.com/docs/get-started.

Next, let's take a look at the official performance benchmark data released.

Model Architecture

LLama-3 adopts a relatively standard single-decoder Transformer architecture. Compared to LLama-2, several key improvements have been implemented in LLama-3: LLama-3 uses a tokenizer with a vocabulary size of 128,000, which can encode languages more efficiently, significantly improving model performance. To enhance the model's inference efficiency, both the 8B and 70B versions use Grouped Query Attention (GQA) technology. Additionally, the model is trained to handle sequences up to 8,192 tokens long, using masking to ensure that self-attention does not cross document boundaries.

Training Data

LLama-3 was pre-trained on over 15 trillion tokens, all sourced from public materials. This dataset is seven times larger than that of LLama-2 and contains four times as much programming content. To accommodate multilingual needs, more than 5% of the data covers over 30 languages, though non-English performance may be slightly inferior. Various data filtering techniques have been developed, such as heuristic filtering, adult content filtering, semantic deduplication, and quality prediction classifiers, to ensure data quality. These techniques help LLama-3 maintain good performance across various application scenarios.

Extended Pre-training

During the pre-training of the LLama-3 model, detailed scaling rules were established to efficiently utilize pre-training data, optimizing data composition and wisely using training computational resources. These scaling rules also help predict the performance of the largest model on key tasks, such as the HumanEval benchmark for code generation.

During the development of LLama-3, new observations were made regarding model scaling behavior. For example, although an ideal training computation volume for an 8B parameter model is approximately 200B tokens, the Meta team found that model performance continues to improve even after increasing the data volume tenfold. After training the 8B and 70B parameter models up to 15 trillion tokens, their performance still grows logarithmically linearly.

To train the large models of LLama-3, we used three parallelization techniques: data parallelism, model parallelism, and pipeline parallelism. During simultaneous training on 16K GPUs, a computational utilization rate of over 400 TFLOPS per GPU was achieved. Training was also conducted on two self-built clusters with 24K GPUs, while developing new training architectures for automated error detection and handling, significantly improving GPU runtime efficiency. Moreover, hardware reliability and silent data corruption detection mechanisms were improved, and a new scalable storage system was developed to reduce checkpoint and rollback overheads. These improvements have increased the training efficiency of LLama-3 by about three times compared to LLama-2.

Fine-Tuning

To maximize the potential of the pre-trained model in chat applications, the team also innovated methods for instruction fine-tuning. This includes supervised fine-tuning (SFT), rejection sampling, Proximal Policy Optimization (PPO), and Direct Preference Optimization (DPO). The quality of prompts used in SFT and the preference rankings used in PPO and DPO significantly impact model performance. By carefully curating these data and conducting multiple rounds of quality assurance on annotations provided by human annotators, significant improvements in model quality were achieved. Learning preference rankings through PPO and DPO significantly enhanced LLama-3's performance in reasoning and programming tasks. Even when the model encounters difficulties answering reasoning questions, it sometimes generates correct reasoning trajectories: the model knows how to produce the right answer but doesn't know how to select it. Through preference ranking training, the model learned how to make choices.