【A Brief History of Intelligence】3. Reinforcing (Vertebrates)

540-485 million years ago, Earth entered the Cambrian period.

At this time, vertebrates began to appear, and the brain structure of vertebrates had the same basic framework: forebrain, midbrain, and hindbrain. The forebrain further developed into the cortex/basal ganglia and thalamus/hypothalamus, beginning to show the prototype of subunits, hierarchy, and processing systems.

Reinforcement Learning and Curiosity

Thorndike proved through his puzzle box experiment that cats can learn through trial and error, a learning method called reinforcement learning, and this ability only began to appear in vertebrates.

Marvin Minsky designed an algorithm that mimics animal learning, called SNARC (Stochastic Neural Analog Reinforcement Calculator). It used 40 linked artificial neural networks, and every time the system successfully navigated out of a maze, it would reinforce the most recently activated synapses. However, the algorithm performed poorly because it was difficult to determine which step should be reinforced. Simply reinforcing the most recent action or all actions is not effective due to the lack of a reasonable mechanism for cross-time credit allocation.

Richard Sutton proposed a new strategy to solve this problem: shifting from using actual rewards to expected rewards. This method learns through temporal differences (Temporal Difference, TD) in reward predictions at different times. Based on this principle, Tesauro developed a chess-playing system with significant success, validating the practicality of TD learning.

In 2018, Google DeepMind developed a new algorithm that successfully passed the first level of the game "Montezuma's Revenge." This algorithm added "curiosity" to Sutton's TD learning, rewarding exploration of new behaviors. Similar to Skinner's box in operant conditioning, changing reward patterns have a greater effect on behavior reinforcement. Previously, our company invited Professor Wang Fei from Tsinghua University to lecture on psychology, where he mentioned that modern humans still retain parts of the "primitive brain." Looking back now, many psychological phenomena can be traced back to the evolutionary history of the nervous system in ancient organisms. From the earlier radially symmetric animals' neurons, to the brain steering abilities of bilaterally symmetric animals, to the reinforcement learning capabilities of vertebrate brains, these have all evolved along such paths.

Curiosity and reinforcement learning co-evolved because curiosity is a necessary condition for reinforcement learning. With the ability to recognize patterns, remember locations, and flexibly adjust behavior based on past rewards and punishments, the earliest vertebrates gained new opportunities: learning itself became an extremely valuable activity. The more patterns a vertebrate recognizes and the more locations it remembers, the greater its chances of survival. The more new things she tries, the more likely she is to discover accidental relationships between behavior and outcomes, thereby learning the correct response.

Dopamine

Deep within the midbrain of vertebrates, there is a small cluster of dopamine neurons that send signals to multiple regions of the brain. Dopamine is associated with reinforcement and serves as the brain's pleasure signal. Dopamine activity increases when unexpected rewards appear, and decreases when expected rewards do not materialize.

Experiments found that cues predicting food arrival in 4 seconds trigger more dopamine release than those predicting food in 16 seconds, a phenomenon known as discounting. This principle was later incorporated into TD learning, driving AI systems to choose actions that obtain rewards faster.

Additionally, signals indicating a 75% probability of food trigger more dopamine release than those indicating a 25% probability, a mechanism also introduced into TD learning.

It is important to note that dopamine is not a reward signal but a reinforcement signal. Reinforcement and reward must be decoupled for reinforcement learning to work effectively. To reasonably address the time credit allocation issue, the brain must reinforce behavior based on predicted future reward changes rather than actual rewards. This evolution began gradually with vertebrates.

Basal Ganglia and Hypothalamus

The mechanism of reinforcement learning originates from the ancient interaction between the basal ganglia and hypothalamus. The specific process is as follows:

: Initially controlled by the hypothalamus. The hypothalamus retains ancestral dopamine-sensitive neurons responsible for categorizing external stimuli as good or bad and triggering corresponding reflexive reactions.
: The hypothalamus only responds to actual rewards and does not become excited by predictive signals. Therefore, it can only react when real rewards arrive.
: The hypothalamus's reward neurons control dopamine release by connecting with clusters of dopamine neurons in the basal ganglia. When the hypothalamus senses pleasure, it releases large amounts of dopamine to the basal ganglia; when it senses discomfort, it inhibits dopamine release.
: The basal ganglia contains two parallel circuits:

One circuit connects to the motor system, controlling body movements and reinforcing these actions by repeatedly triggering dopamine release.
Another circuit connects to dopamine neurons, focusing on predicting future rewards and actively triggering dopamine activation.

: Initially, the basal ganglia relied on feedback from the hypothalamus for learning. Over time, they gradually learned to self-judge, recognizing their own errors before hypothalamus feedback. This is why dopamine neurons initially respond to the first reward but over time shift their response to predictive reward cues.

: The basal ganglia repeats behaviors that maximize dopamine release, consistent with Sutton's "actor" theory. This system aims to reinforce behaviors that lead to positive outcomes while inhibiting those that result in punishment.

Through this mechanism, the basal ganglia and hypothalamus together construct the reinforcement learning system of vertebrates.

Pattern Recognition

From invertebrates evolving to vertebrates, animals began to possess brain structures capable of utilizing decoding neuron patterns to recognize objects, greatly expanding their perceptual range. In a universe with only fifty olfactory neurons, these neurons can identify different patterns. Just fifty cells can represent up to one hundred trillion patterns.

Pattern recognition faces two main challenges:

: How to distinguish overlapping patterns as different patterns.
: How to generalize already recognized patterns to identify similar but not identical new patterns.

In the AI field, supervised learning and backpropagation algorithms are applied to image recognition, natural language processing, speech recognition, and autonomous driving cars, effectively addressing the above two challenges.

However, the brain uses unsupervised learning and does not rely on backpropagation; it addresses pattern recognition challenges through other mechanisms.

For example, olfactory neurons send signals to pyramidal neurons in the cerebral cortex, involving the following two interesting characteristics:

: A few olfactory neurons connect to a much larger number of cortical neurons, greatly expanding the space for information processing.
: A specific olfactory neuron connects only to a subset of cortical cells, not all cells.

These two seemingly simple wiring features may effectively solve the

—the cerebral cortex can recognize similar but different patterns.

However, just like the learning process in vertebrate brains, when neural networks learn new knowledge, they may forget old knowledge. That is, learning new patterns may interfere with previously learned ones. Therefore, like some models in AI, all content must be learned at once, after which learning stops (locking all parameters).

CNN

Visual objects activate different neuron patterns when rotated, moved closer or farther, or positioned differently, leading to the so-called "invariance problem": how to recognize the same object despite changes in input (such as the two images below). The brain somehow solves this problem.

David Hubel and Torsten Wiesel discovered the hierarchical mechanism of visual processing by showing different visual stimuli to cats and recording neuronal activity. The first area to receive visual signals is V1 (the primary visual area). They found that neurons in V1 are very sensitive to specific line orientations and positions. For example, some neurons only react to vertical lines, while others react to horizontal lines or 45-degree angle lines. V1 acts as a map of the entire visual field of the cat, with different neurons corresponding to different positions and directions of lines.

The visual system starts from V1, breaking down complex visual patterns into simple lines and edges. Then, the output from V1 is passed to higher-level areas such as V2, V4, and finally IT. In this hierarchy, as the processing level rises, neurons become increasingly sensitive to more complex features—V1 handles basic lines, V2 and V4 handle more complex shapes, and IT identifies whole objects such as faces. V1 is only sensitive to inputs in specific regions of the visual field, while IT can recognize objects throughout the entire visual field. This process of integrating simple features into complex objects solves the visual invariance problem.

Hubel and Wiesel's two major discoveries:

Visual processing is hierarchical, with low-level neurons identifying simple features and high-level neurons identifying complex objects.
Neurons at the same level are sensitive to the same features but responsible for different input positions.

Inspired by these findings, Fukushima proposed the concept of convolutional neural networks (CNNs). Like V1, CNNs first break down input images into feature maps, each showing the location of a specific feature (such as vertical or horizontal lines) in the input image. This process is called convolution.

Fukushima's innovation lay in introducing "inductive bias," assumptions built into the system during design. CNNs assume that the same features in different positions should be treated identically, solving the translation invariance problem. By encoding this rule directly into the network architecture, CNNs can efficiently learn and process visual information without requiring large amounts of data and time to manually learn this rule.

Comparative psychologist Carolyn Delong trained goldfish to click on pictures to obtain food to study their cognitive abilities. She showed the goldfish two pictures, and whenever they clicked on the frog picture, they were rewarded with food. Soon, the goldfish learned to swim towards the frog picture whenever they saw it. Next, Delong changed the experiment, showing a picture of the same frog from an angle the goldfish had never seen. Surprisingly, the goldfish swam towards the new frog picture, clearly able to immediately recognize the frog. This shows that the fish's brain surpasses even our most advanced computer vision systems in some aspects. CNNs need large amounts of data to understand object rotation and 3D changes, but fish seem to instantly recognize new angles of objects.

World Models

The evolution of the semicircular canals originated in early vertebrates, appearing almost simultaneously with reinforcement learning and the ability to build spatial maps. Vestibular sensation is crucial for building spatial maps. In the hindbrain of vertebrates, whether in fish or mice and other species, there are "head direction neurons" that only fire when the animal is facing a specific direction. These neurons integrate visual and vestibular inputs to form a neural compass, allowing vertebrate brains to simulate and navigate three-dimensional space.

The medial cortex is a part of the cerebral cortex that later evolved into the hippocampus in mammals. If you record the neural activity of the hippocampus while fish swim around, you will find that some neurons only activate when the fish is in a specific spatial position; others activate when the fish approaches the edge of the tank or faces a certain direction. Visual, vestibular, and head direction signals converge in the medial cortex, mixing here and transforming into a spatial map.

The most important breakthrough in this process is the brain's construction of an internal model—a representation of the external world. Initially, this model might have only helped the brain identify arbitrary positions in space and calculate the correct direction from any starting point to the target. But the construction of this internal model laid the foundation for the brain's future evolution. It evolved from an initial tool for remembering spatial positions into more complex functions.