》, titled "From GPT-4 to AGI: Counting the OOMs".
Achieving Artificial General Intelligence (AGI) by 2027 is highly feasible.
GPT-4 is merely a continuation of the rapid progress in the field of deep learning over the past decade."
Ten years ago, models couldn't even recognize simple pictures of cats and dogs; Four years ago, GPT-2 could barely piece together semi-plausible sentences. Now, we are rapidly surpassing all sorts of benchmarks.
And this dramatic progress is actually just the result of the continued scaling of deep learning. These models, all they want to do is learn; you scale them up, and they learn more. By 2027, it's very reasonable to predict that models will be able to do the work of AI researchers or engineers.
The past four years
〜 Preschooler
At that time, people were amazed that it could coherently string together a few sentences that seemed to make sense.
: GPT-2 performed pretty well when answering very basic reading comprehension questions. : In carefully cherry-picked examples (the best out of 10 attempts), GPT-2 generated a semi-coherent passage discussing some aspects of the American Civil War.
What's shocking about GPT-2 is its ability to handle language, and occasionally generate a semi-coherent paragraph, or occasionally correctly answer simple factual questions. It's like the performance of a preschooler - very surprising.
~ Elementary school student
By the time we get to GPT-3, the capabilities of AI have taken another step forward:
: Under simple instructions, GPT-3 is able to use a fictional word and apply it in new sentences. : GPT-3 can engage in creative story interactions and generate rich narrative content. : GPT-3 can generate some very simple code.
Although this analogy is still imperfect, what surprises people about GPT-3 might just be its elementary school-level performance: it can create simple poetry, tell richer and more coherent stories, begin basic coding, and reliably learn from simple instructions and demonstrations, etc.
~ Smart high school students
GPT-4 further breaks the original boundaries, demonstrating even more outstanding capabilities:
: GPT-4 can write very complex code (the chart generated in the middle is displayed) and can reason to solve non-trivial mathematical problems. : Solve an AP math problem. : Solve a relatively complex programming problem.
Nevertheless, GPT-4 still shows some imbalance in performance; for some tasks, it far surpasses a smart high school student, while for others, its performance falls short of expectations.
Trends in deep learning
In the past, it would usually take decades to break widely used benchmarks, but now, the breakthrough time for these tasks seems to be only a few months.
When I was reviewing the notes, there was a perspective: so-called emergence doesn't mean a sudden leap in new capabilities or linear growth, but rather that our benchmarking can't keep up and is unable to measure such advanced intelligence.
From GPT-3.5 to GPT-4, the leap in model performance at human percentile levels is very significant. In many tests, GPT-4's scores have rapidly risen from far below the human median to the top of the range of human capabilities.
The GPT-3.5 discussed here is actually a relatively new model, released less than a year before GPT-4, and its performance far surpasses that of the earlier GPT-3 (the older version that could only handle elementary-level conversations).
Consider the MATH benchmark, which consists of difficult problems taken from high school math competitions. When this benchmark was introduced in 2021, state-of-the-art models could only correctly answer about 5% of the questions. The original paper noted: "Moreover, we find that scaling up along current trends in compute and parameters alone is unlikely to yield strong mathematical reasoning abilities... To achieve greater success on mathematical problem-solving, we may need algorithmic advances from a broader research community." In other words, experts at the time believed that solving MATH problems would require fundamental breakthroughs. However, by mid-2022—just one year later—the accuracy of state-of-the-art models jumped from around 5% to 50%, and now MATH has been almost fully conquered, with recent performances exceeding 90%.
Today, one of the hardest benchmarks to crack might be GPQA, which includes PhD-level questions in biology, chemistry, and physics. Many of these questions seem like gibberish to me, and even after spending over 30 minutes using Google, PhDs in unrelated fields often perform no better than random guessing. Claude 3 Opus currently scores about 60%, while PhDs in relevant fields score around 80%. I expect that as new generations of AI models emerge, this benchmark will also be surpassed.
, which involved sending two sophons to Earth to interfere with accelerators and prevent Earthlings from breaking through technological barriers in a short period of time.
Calculating OOM: The Magic and Accelerated Progress of Deep Learning
OOM is the abbreviation for order of magnitude, representing a power of ten; each increase by 10 times equals one order of magnitude.
In the case of OpenAI Sora, we can observe that with each "OOM" (order of magnitude growth) in effective computation, the model's performance improves steadily and predictably. If we can calculate these increases in OOMs, we can roughly estimate the improvement in capabilities. This is also why some forward-thinking individuals anticipated the emergence of GPT-4.
We can measure these improvements through "OOM (Order of Magnitude) calculations":
3x improvement = 0.5 OOM 10x improvement = 1 OOM 30x improvement = 1.5 OOM 100x improvement = 2 OOM
By analogy, in this way, we can also predict the further development from 2023 to 2027 after GPT-4.
Three main ways of expansion
We can break down the four-year progress from GPT-2 to GPT-4 into three main ways of expansion:
The expansion of computational resources:
. This leap is not solely due to technical improvements, but rather because of massive financial investments.
In the past, spending millions of dollars to train a single model was unimaginably extravagant. But today, such an investment has become trivial in comparison. This level of investment has driven a significant expansion in computing resources, enabling the scale of training to jump from small experimental levels to industrial-scale operations. By continuously increasing computational resources, we are rapidly breaking through the performance limitations of deep learning. The exponential growth in computing power provides reliable enhancements to model performance and continues to expand the application scope of deep learning.
Silicon Valley rumors are rife with discussions about substantial GPU order amounts, indicating that related investments are being rapidly deployed. Although these investments are enormous, they are already underway and will further drive the expansion of computational capabilities for model training in the future.
clusters) also seems quite possible, with rumors that Microsoft and OpenAI are planning the construction of this ultra-large computing cluster.
improvement in algorithm efficiency:
to achieve the same performance, which is equivalent to a 10-fold increase in effective computational power
If we take a long-term perspective, the progress of algorithms seems quite stable. A single discovery may be accidental, and each step appears to face insurmountable obstacles, but looking at the long-term trend, the entire process is predictable, like a straight line. We can trust this trend line.
, we have seen that computational efficiency improves by about
versions.
innovation on the architecture further accelerates the improvement of computational efficiency. Other studies also show that MoE brings significant computational improvements.
efficiency boost, which further demonstrates that significant improvements in computational efficiency can be achieved through optimized algorithms.
the performance enhancement.
additional efficiency improvements. In more optimistic forecasts, we might witness architectural breakthroughs similar to Transformers, which would bring even greater efficiency gains and performance leaps.
"unlock" benefits:
have made tremendous progress. These algorithmic improvements are not just about training better base models; they often require minimal pre-training computational resources, yet they can unleash the powerful capabilities of the models:
Reinforcement Learning from Human Feedback (:
a larger model.
Chain of Thought (CoT):
The improvement in computational efficiency. CoT is a technology that only started to be widely used two years ago, greatly enhancing the capabilities of models.
Scaffolding:
of the questions, but with the support structure provided by Devin, this proportion can be increased to **14-23%**. Currently, unlocking the model's "agency" is still in its early stages, which I will discuss in more detail in subsequent discussions.
Tool usage:
we can now use tools like web browsers, run code, etc., greatly enhancing its practical application capabilities.
Context length:
context. This ability to process more context can be seen as a significant enhancement in computational efficiency. More context allows models to unlock more potential in various application scenarios, such as writing code or drafting documents.
Post-training improvements:
the price difference corresponds to.
base model and adopting "potential-unlocking" techniques, the model's performance in its proxy tasks improved significantly: from **5%** when only using the base model, to **20%** at the time of release, and now close to **40%** through better post-training, tools, and proxy support structures.
The annual improvement in computational efficiency has already been quite significant, but these improvements are only part of the story; combined with "unlocking potential" technologies, they may account for much of the progress in current trends.
Today's models still have many limitations! For example:
They lack long-term memory capabilities. They cannot use computers (only very limited tools are currently available). write an article, it behaves like asking a human to write an article through initial stream of consciousness. They can (mostly) only engage in short conversations and cannot think about a problem, research different solutions, communicate with others, and then write a longer report or pull request like humans do. They are not personalized for your specific application most of the time (just a general chatbot based on brief prompts without the context of your company or work).
、An existence like a colleague.
may come earlier than the progress of "unlocking potential," meaning that by the time remote workers are able to automate a large amount of work, the intermediate model may not yet have been fully mastered and integrated. As such, the leap in economic value may manifest as a
The next four years
Looking ahead four years beyond GPT-4, we anticipate seeing similar advancements, with an expected growth in computing power of 3 to 6 orders of magnitude (OOMs), perhaps the best guess being around 5 OOMs. In addition, there will be breakthroughs in practicality and application as models evolve from chatbots into agents capable of executing complex tasks, such as remote work.
To help understand this progress, let's make a hypothetical assumption: suppose training GPT-4 took three months. By 2027, leading AI labs might only need one minute to train a model equivalent to GPT-4. The increase in computational capability will be extremely significant.
Orders of magnitude in progress
The leap from GPT-2 to GPT-4 can be likened to the growth from a preschooler to a high school student, from outputting simple sentences to handling college entrance exams with ease and becoming an effective programming assistant. This is an astonishing advancement. If we achieve similar progress in the future, the results could be very surprising, potentially even producing intelligence that surpasses that of PhDs and domain experts.
There’s an interesting analogy: the current pace of AI advancement is roughly three times the speed of child development. Imagine your "triple-speed child" just graduated from high school, and soon after, it might take over your job!
It is worth noting that this is not just about imagining an exceptionally smart ChatGPT. As the "unlocking" progresses, future AI will be more than just a chatbot; it will be an intelligent agent capable of independent reasoning, planning, error correction, and in-depth understanding of you and your company, able to work independently for weeks. It will become a true "remote worker" on whom you can rely to complete complex tasks.
The practical implications of AGI
There have been many recent discussions lowering the definition of AGI to a lower level—such as merely being an excellent chatbot. However, the author's understanding is that AGI should be a system capable of fully automating my job and replacing the roles of work for me and my friends. For example, it should be able to completely replace the work of an AI researcher or engineer. While fields like robotics may take longer to resolve, once AI models can achieve automation of AI research, it would be enough to initiate a strong feedback loop, driving progress very quickly. An automated AI researcher might compress ten years of algorithmic progress within a year.
AGI will only be a small harbinger of superintelligence, with the era of superintelligence following closely behind. In the future, don't be surprised by the speed of progress—each new generation of AI models will astonish observers. When these models can easily solve scientific problems, write millions of lines of code, or generate several times the economic value within a few years, we will know that AGI is no longer a distant fantasy. Simple expansions of deep learning technology have already proven effective, and the models themselves are eager to learn; by 2027, we may see progress 100,000 times greater.
By then, their intelligence may already surpass that of humans.