11 Feb 2026
Welcoming Mass to the Land of Error
We have resumed our reading of The Little Learner. We are in the midst of an explanation of ADAM, or “Adaptive Moment Estimation,” which is a technique for speeding up gradient descent.
This technique extends the central metaphor of deep learning: gradient descent. In this metaphor, the model is a seeker, searching across the “error surface” for the place of lowest altitude. On the error surface, the high points are erroneous. The low points are correct. The error surface is treacherous terrain. It is pitted throughout with “local optima,” apparently deep clefts where the model may get stuck, preventing it from searching the true depths of correctness. One of the tricks of gradient descent is to set up the learning phase so that the model is more likely to end up in a profound crevasse of correct behaviour, rather than a shallow depression of meagre accuracy.

ADAM extends this metaphor by adding the idea of “momentum” into the gradient descent. In the version of gradient descent we have explored so far, the model has position and velocity, but no mass and therefore no momentum. The model’s “position” on the error surface is given by its parameters: depending on how its parameters are set, it will perform better or worse on the training data according to the loss function. Depending on its parameters, it will be higher (wronger) or lower (righter) on the error-surface. The model’s “velocity” so far has been given by the gradient of the loss function. When the loss is calculated, some fancy calculus is applied to determine how the parameters should be changed to make it a little more accurate next time. Changing the parameters changes its position on the error-surface. Since a change of position is “velocity,” the update function effectively sets the “velocity” of the model in this extended metaphor.
Basically, the way that momentum-based optimisations like ADAM work is this: Each time you update the parameters of the model, remember how you updated it. Then next time, when you calculate the velocity again, you can remember how you changed it last time. As the model move about the error-surface, it effectively “remembers” its prior movements in the same way that a freight train “remembers” it is moving forwards. By retaining some of the previous velocity, the model effectively becomes more “massive,” will acquire more “momentum,” and will move more quickly in the direction it is already moving. Accordingly, like a freight train, it will barrel over humps and obstacles on the error-surface without being jittered about.
The variable that stores this “momementum” is called $\mu$ (“mu”) in The Little Learner. $\mu$ is basically $m$, which is the first letter of “momentum” (obviously). The way the learning algorithm “remembers” $\mu$ is by averaging the model’s previous velocities. Each time a new training step commences, the new velocity is combined with the average of the old velocities using a “decay rate” that determines how much of the old velocity to mix with the new. The higher the decay rate, the less old velocity is remembered, i.e. the less mass and the less momentum the model has. The lower the decay rate, the more old velocity is remembered, i.e. the more mass and the more momentum the model has.
Simply incorporating $\mu$ doesn’t quite extend the metaphor properly. One aspect of the error-surface, shown in the diagram above, is that the gradient gets less and less—that is, the slope gets shallower and shallower—the closer you get to the bottom of the descent. Since we use the gradient—the slope—to determine the velocity of the model’s updates, graident descent tends to slow down the closer the model gets to the end of its descent. This is not very similar to a massive object with momentum. So to perfect the metaphor, we change the gradient descent algorithm again. Previously, there was a fixed “learning rate”, $\alpha$. If the model is a car, then $\alpha$ is the brakes. As the model rolls down the slope, you multiply the slope by a small number, e.g. $0.001$, so that the model doesn’t go careering all over the error-surface like a drunk driver. In the new version of gradient descent, you gradually increase the learning rate. As alpha goes up, it is like taking your foot gradually off the brakes. Thus as the slope gets shallower, the model “brakes” less hard, allowing it to still move quickly towards the beautiful deep valley of intelligence to which it hies.
ADAM means “ADAptive Moment estimation”—or, if you like, “ADAptive Momentum”. $\mu$ is the “Momentum” part. The gradual increase of the learning rate is the “ADAptive” part.
Precedents
The idea of error as a landscape, through which the earnest seeker travels to the find the truth, is a very old idea. It puts me in mind of The Faery Queene. At the beginning of Book I, the Redcrosse Knight and Una lose themselves in the forest of Error:
Led with delight, they thus beguile the way,
Untill the blustring storme is overblowne;
When weening to returne, whence they did stray,
They cannot finde that path, which first was showne,
But wander too and fro in wayes unknowne,
Furthest from end then, when they neerest weene,
That makes them doubt their wits be not their owne:
So many pathes, so many turnings seene,
That which of them to take, in diverse doubt they been.
I’m also put in mind of this image, from a sonnet of Charles Harpur’s:
The river that, like a pure mind beguiled,
Grows purer for its errors
In a forthcoming article for AI & Society, I argue that the abstractions of computer programming should be understood as the “virtualisation of metaphor.” What I’m getting at is that computer programs make metaphors real. Not real in a material sense, but virtually real in the way that they shape the behaviour of computers. There is a strain in critical theory that denies any reality to these abstractions (e.g. Galloway, Kittler, Tenen), preferring to see computers materially, as the dance of electrons in silicon. But to me, the metaphorical and virtual dimension is vitally important.
To many people in Silicon Valley today, the “error-surface” is the world, and the Holy Grail of AI lies somewhere deep down, in the farthest and most profound hollow of its mountains and valleys. They take this metaphor extremely seriously, refining and implementing it in the hope of finally creating an AI. But even if it is a fruitful and suggestive metaphor—as Spenser and Harpur prove—it is not the only or even necessarily the best metaphor. The true path of the quest for intelligence may not lie on the rocky multidimensional landscape of a predefined loss function conditioned on a pre-defined training set!
Notes
11 Dec 2025
We had our first in-person Anticodians meeting at DHA25 last week in Canberra/Ngambri. About 20 anticodians, some old some new, assembled at the Australian National University to read a Python program slowly.
And read slowly we did. We managed to read about ⅓ of Peter Norvig’s famed lis.py program.
It is difficult for me to recall our discussion a week later, but the workshop overall demonstrated to me the vital importance of Critical Code Studies, and the suitability of the “slow reading” approach in CCS.
Simplicity is complexity, or something
Much of our discussion focussed on Norvig’s aesthetic principles. Norvig makes two main claims in his lis.py essay: (1) that to understand programming, you need to understand the way programming languages are implemented; and (2) LISP (or rather, Scheme) is the most beautiful programming language. The two claims are interrelated. A key reason for LISP’s beauty, in Norvig’s account, is the fact that LISP is very easy to implement.
One of the key terms in Norvig’s aesthetic vocabulary is “simple.” We debated the “simplicity” of LISP/Scheme vigorously in the group. Many in the group found the “complex” syntax of Java “simpler” to understand than the “simple” syntax of LISP. This elementary contradiction (which Norvig would surely anticipate) unleashed a host of questions and replies, centred on the core question: simple for whom?
Simplicity can seem like a starting point. Things start simple and become complex. But perhaps simplicity is best understood as an achievement. Thing start complex and become simple. First chaos, then order.
Simplicity may also have arisen as a fetish among computer programmers for good, practical reasons. In the early days of computing, pointed out one participant, computers were extremely short on working memory, and it helped if a program could be wedged effectively into a small number of CPU registers. The “simplicity” of LISP is ideal for this.
In fact, there was once a reasonable market for computers that were custom built for LISP, so that the LISP interpreter was implemented in hardware rather than software! The creators of the Scheme dialect of LISP called such a hardware-implemented LISP a “dielectric LISP”, which I think is quite cute.
The craft of programming
During the session, four of the anticodians presented lightning talks on GenAI and coding. Critical Code Studies (CCS) rests on the premise that code is a form of human expression, and deserves the same kind of loving attention that we give to literature. If coding agents live up to their manufacturers’ hype, and largely replace human programmers, will there still be a role for CCS?
Dylan Chng addressed the question by considering the broad context of computer programming. The actual writing of code is contingent on many other systems, people and processes. If we want to assess the impacts of AI on the human writing of code, then coding agents need to be understood in this broader context.
Emily Fitzgerald described her experience as a self-taught programmer, relying on coding agents to help her understand new aspects of programming. In her practical view, coding agents can provide hints and tips, but not much more.
Kath Bode considered similar questions in her talk on teaching programming using chatbots. Kath’s main argument was that it is impossible to prompt a chatbot effectively, or understand its output, unless you already have a strong foundation in “programmatic thinking.” Her arguments chime with familiar ideas in programming education. Mastering the formal notation of a programming language is a superficial acquirement. The true craft of programming is analytic and imaginative—understanding what you want from the machine, and how the machine might provide it. There is precious little evidence that any AI system will be helpful in this regard anytime soon.
Leah Hendrickson approached the problem of AI writing from a starkly different approach. In a study that will soon appear in AI & Society, she has revisted a survey she conducted in 2021. How do people react to AI-generated text today? Who do they think is the author of the text: the generator, the prompter, the developer, the company who own the generator, or someone else? Her surprising finding is that the release of ChatGPT does not seem to have shifted people’s attitudes.
Whither CCS?
CCS will have to grapple with AI, as will all the writerly disciplines. But if our little session at DHA25 is any guide, CCS is going nowhere fast. It was a joy to linger over Norvig’s lines of code, and his explanations of them, and to discuss the fundamentals of programming with a diverse group.
In our final meeting for 2025, we’ll see if we can finish lis.py.
08 Oct 2025
This week we read up to frame 17 on page 124 of The Little Learner. Despite my laxity with the blog, we have been progressing!
Recap
At some point in the last few weeks, we put all the pieces together for machine learning via gradient descent. A complete program looks as follows:
(with-hypers ; 1. Choose hyperparameters
((revs 1000) ; 2. Hyperparameter: number of revisions
(alpha 0.01)) ; 3. Hyperparameter: learning rate
(gradient-descent ; 4. Learning algorithm: gradient descent
((l2-loss line) line-xs line-ys) ; 5. Objective function
(list 0.0 0.0))) ; 6. Initial guess
If you run this code, using the data provided in the book, the output will be:
(list 1.0499993623489503 1.8747718457656533e-6) ; 7. Learning!
To begin with, let’s recap this example.
1. Choose hyperparameters
In machine learning, there is a crucial distinction between “parameters” and “hyperparameters.” In a nutshell, “hyperparameters” are supplied by the human programmer, while “parameters” are learned by the program. In malt, the MAchine Learning Toolkit included with The Little Learner, the with-hypers function is provided to help manage the setting of hyperparameters. The function has the following structure:
(with-hypers *list-of-hyperparameters* *learning-problem*)
In this case, the list of hyperparameters contains two items:
((revs 1000) (alpha 0.001))
And for the learning problem, we are trying to fit a line model to the data in line-xs and line-ys using gradient descent.
(gradient-descent
((l2-loss line) line-xs line-ys)
(list 0.0 0.0))
2. Hyperparameter: revisions
The first hyperparameter, revs, tells the programs how many revisions to perform during training. As we have seen in previous weeks, machine learning proceeds via guesswork. Guess the parameters. Check the guess. Revise the guess. Guess again. Check again. Revise again. And so on. The revs hyperparameter tells the program how many times to repeat the cycle.
In this case, revs is set to 1000, so the program will revise the guess $1000$ times before it stops.
In fact this is not the only way to set up a learning problem. In practice, it is also common to set up tolerances for the revisions. For example, there is no point doing all $1000$ revisions if something has gone wrong, and the guess is getting worse each time. Likewise, if each guess is only making a tiny improvement on the previous guess, it may be worth stopping the training process early.
But there is a deep reason why it is common to set an arbitrary number of revisions in machine learning. In machine learning, there is often no single correct answer for the problem. There is no single “correct” sentence that ChatGPT should produce in response to your prompt. Instead, there is a wide range of “good enough” answers, and the point is to try and find one of them. If the machine learning is successful, then the program should find a good enough answer in a finite time, and we can just set a certain number of revisions.
In practice, also, training models can be extremely time- and resource-intensive. Large models might require many computers, many hours and many kilojoules to train. The revs parameter may therefore be determined by economic or environmental factors, rather than scientific ones.
Hyperparameter: learning rate
The second hyperparameter, alpha, sets the learning rate of the gradient descent algorithm. The learning rate is a tricky concept, which I explained in a previous post.
The clever idea behind gradient descent is to learn from error. But if you try to do this naively, the results will be terrible. The learning rate suppresses the learning process, to ensure the guesses are revised incrementally, and the model doesn’t go haywire during training. It should therefore be no surprise alpha is a small number, in this case, $0.001$. When you multiply a number by $0.001$, it makes the number $1000 \times$ smaller. In this way, alpha ensures that the model improves only a little bit each revision.
Learning algorithm
Now we move on to the the second part of the expression. We have given the hyperparameters. Now we need to give the learning problem. In malt, the first thing you specify is the learning algorthm, in this case, gradient-descent. I have already recapped this algorithm, in the discussion of alpha. The gradient-descent function in malt requires two inputs:
(gradient-descent *objective-function* *initial-guess*)
In this case, the objective function is ((l2-loss line) line-xs line-ys), and the initial guess is (list 0.0 0.0).
Objective function
I explained the objective function in a previous post. It is called the objective function because it defines the training objective. In other words, the objective function tells the program what it should try to learn. The objective function itself has three parts:
- The loss function, in this case
l2-loss: The program will use this function to test each guess, and measure how inaccurate it is, i.e. to measure the “loss.” The l2-loss is so called because to calculate it, you square the error, i.e. you raise it to the power of 2.
- The target function, in this case
line: This is the template for the model that the function will learn.
- The training data, in this case
line-xs and line-ys: The program will try to find the right parameters for the target function, so that if you plug a x-value into the target function, a quantity close to the corresponding y-value will come out.
In this case, the target function is line. This is the target function that the authors of The Little Learner use for most of the early examples. It is an appropriate target function when you have two variables—a predictor variable (x) and an output variable (y)—and you think there is a straightforward linear relationship between them. For example, perhaps you predict that the height of a flower (y) is determined by how many seconds have elapsed since the seedling burst through the soil (x). If you think that flowers grow at a constant rate, then the line function would be appropriate.
The line function looks like this:
$$
y = wx + b
$$
It has two parameters, w and b. The aim in machine learning is to find the correct w and b using the training data. Let’s say, for example, you are trying to learn the flower model above. You have some data where lab scientists have measured how tall some flowers are at certain times, and you know when the flowers first appeared as seedlings. Let’s say that the seedlings have a minimum height of 1mm when they are first perceived by the human scientist, and they grow roughly at a constant rate of 0.01mm per second. Ideally, if you ran the above code with these data points, the program should learn the following function:
$$
y = 0.01x + 1
$$
In scheme, the output would look like this:
Using these learned parameters, you could predict how tall a flower would be from the number of seconds that have elapsed. E.g. if the flower is 120 seconds old, you could predict that it will be…
$$
y = 0.01(120) + 1
$$
$$
y = 2.2
$$
millimetres tall. In the machine learning world, such a prediction is called “inference.”
Of course, before we do the machine learning, we don’t know what the parameters are. In this case, we want the computer to work out what w and b should be. That leads us to the penultimate part of the example code…
6. Initial guess
The initial guess is the starting point for the learning process. In this case, we set w to $0$ and b to $0$:
The initial guess is called “little theta” ($\theta$) in The Little Learner. The parameters learned by the gradient descent algorithm are called “big theta” ($\Theta$).
When you run the gradient descent algorithm, this initial guess will be improved revs times, hopefully producing much better values for w and b—or for whatever parameters are required by your target function.
7. Parameters!
If all goes well, the machine learning program should spit out some new parameters, which can be combined with the target function to start making predictions or inferences. In the example code, using the provided line-xs and line-ys, the learned parameters ($\Theta$) are:
(list 1.0499993623489503 1.8747718457656533e-6)
If you combine these with the line function, you get the following linear equation (rounded to two decimal places):
$$
y = 1.05x + 0.00
$$
Well, if x is 72, what would y be?
$$
y = 1.05(72) + 0.00
$$
$$
y = 75.6
$$
Or in Scheme, using malt:
((line 72) ; x = 72
(list 1.0499993623489503 1.8747718457656533e-6) ; w and b
)
; 75.59995596389626
A plethora of models
There is an obvious limitation with the example above: the target function. line is an extremely simple function. It takes exactly one input, and relates it in a very simple way to the output. It is very easy to think of situations where this target function would be woefully inadequate. Imagine, for example, that you wanted to predict the price of a house. y in this case would be an amount in dollars. What x-values might you have available? You might know the square footage of the house, the number of rooms, the distance from the CBD, the age of the house, the state of repair, and so on. All of these things might seperately affect the price.
line would be hopeless, because you would need a seperate model for each predictor. One line the predicts the price based on the number of rooms, another line that predicts the price based on the square footage of the house, and so on. Each model would be extremely inaccurate because it only accounts for a single feature of the house. What might work better is a target function like this:
$$
y = w_0 x_0 + w_1 x_1 + w_2 x_2 + … + w_n x_n + b
$$
In this case, $x_0$ might be the distance from the CBD in kilometres, $x_1$ might be the amount of tree cover in the street, and so on. The model could learn to combine all these factors to predict the overall price, $y$.
In our last session, we saw a target function that could do this, though it was presented differently. The function is called plane. It is identical to the function given above, but uses a more compact notation:
$$
y = \begin{Bmatrix} w_0 & w_1 & w_2 & … & w_n \end{Bmatrix} \cdot \begin{Bmatrix} x_0 & x_1 & x_2 & … & x_n \end{Bmatrix} + b
$$
This is exactly identical to the somewhat simpler notation above. Each of the sets of numbers is a tensor1, also known as a vector. This symbol $\cdot$ means “dot product.” When you take the “dot product” of two vectors, it means that you multiply the corresponding elements (e.g. $w_1 \times x_1$), and then add up all the products. This is of course exactly what we do in the more familiar representation of a linear equation above.
Another example where line is an inadequate target function is when the relationship between two variables is non-linear. In visual terms, this means that a graph of the two variables is curved rather than straight. In this case, the problem is that all the x values are raised to the power of $1$. We don’t normally write the power when it is a power of $1$, but we can do so to make the point:
$$
y = w_0 x_0^1 + w_1 x_1^1 + w_2 x_2^1 + … + w_n x_n^1 + b
$$
To allow this line to curve, one or more of these $x$-values can be raised to a power other than $1$, e.g.
$$
y = w_0 x_0^1 + w_1 x_1^2 + w_2 x_2^1 + … + w_n x_n^k + b
$$
In the example code, we looked at the target function quad. This is a very simple function, similar to line:
$$
y = w_0 x^2 + w_1 x + w_2
$$
Because it contains an $x^2$, it can learn a curved, non-linear relationship between x and y.
Target Functions in the Press
In 2017, a new target function was discovered, which triggered the current phase of AI hype: the “transformer,” or the T in GPT. We do not have the tools yet to describe the transformer, but we do now know what a target function is, and can appreciate why the discovery of a new target function might be significant. The transformer was initially designed for language modelling: predicting which words should come next based on which have come before. None of target functions we have considered so far, line, quad and plane, are up to this task—though plane is getting us closer.
The example of the transformer target function is instructive. It has revolutionised language modelling, allowing engineers to build much more powerful and persuasive models of human speech and writing. It has also revolutionsed image and sound generation. But this algorithm did not arise from linguistic research. It is not based on any insight into the nature of language. It was not chosen because it seems like the ideal target function for capturing how language works. Rather, it was invented primarily to address practical problems with training Large Language Models (LLMs). Basically, when language models get very large, they can become unwieldly. One of the biggest problems is that there can be relationships between words that stretch a very long way through a text, e.g. if the word “Elizabeth” appears at the start of a novel, this increases the chance that the word “Elizabeth” appears at the end of the novel—unless of course the phrase “Elizabeth died” appears halfway through, and so on. In this context, the gradient descent algorithm could grind to a halt, or could be subject to a fiendish issue called “exploding gradients,” where the model overcorrects even if you set the learning rate to a very low number. Through some elegant tricks, the transformer model fixed these mathematical issues.
If this is all that the transformer target function did, then how did it enable such progress in LLMs? It did so by speeding up training (through “parrallelisation”) and eliminating the pernicious “exploding gradients.” These changes allowed LLMs to get much, much larger. Their size increase in three main ways: they could have many many more parameters (billions or trillions of ws); they could be trained efficiently on much larger training sets; and they could look at much wider “context windows” (e.g. predicting a word based on the prior million words, instead of the prior 100 or 1000 words). All these changes, in turn, allowed LLMs to produce much more coherent and plausible text.
If we understand the role of the target function, and the criteria by which it is assessed, then we may be in a better position to evaluate claims about progress in AI. Was the discovery of the transformer architecture a “step on the way” to AI? Or was it just an architectural change to some commercial systems that let them run more efficiently…?
LISP as an expression-language
I love LISP, but are we LISPers wrong to find it elegant? Consider again the key example from the first hundred pages of The Little Learner:
(with-hypers ; 1. Choose hyperparameters
((revs 1000) ; 2. Hyperparameter: number of revisions
(alpha 0.01)) ; 3. Hyperparameter: learning rate
(gradient-descent ; 4. Learning algorithm: gradient descent
((l2-loss line) line-xs line-ys) ; 5. Objective function
(list 0.0 0.0))) ; 6. Initial guess
Wouldn’t this be easier in Python? (This code won’t work, though it could be made to work)
hyperparameters = {
"revs": 1000,
"alpha": 0.01
}
objective_function = make_objective_function(
target=line,
loss=l2_loss
training_x=line_xs,
training_y=line_ys
)
learned_parameters = gradient_descent(
objective_function=objective_function,
hyperparameters=hyperparameters,
initial_guess=[0.0 0.0]
)
The LISP/Scheme/Racket/malt code is extremely concise, and the structure of the code also directly reflects the structure of the program. But is this kind of poetry really the right way to impart knowledge?
It does make me think of some of the great Sanskrit philosophers, such as Nagarjuna and Gaṅgeśa, who composed their works in extremely terse couplets (or shlokas). The books are almost impossible to read (I’ve only tried in English translation!), and can only be understood through copious commentary on each highly compressed couplet. One theory is that these poems were used as University curricula. Philosophers would memorise their own poems, or those of their teachers, and recite them to students in class. Each shloka would be recited individually, and extensively discussed, as a way to explore the problem. To me, The Little Learner hearkens back to this mode of instruction. The code is gnomic and compressed, but it is also fantastically well structured and poetic. The two “voices” in the text discuss the code poetry, and unpack it for the reader. We, the Anticodians, discuss and unpack it further. The only missing piece is the memorisation… and I don’t think any of use is up for that!
30 Jul 2025
In this meeting, we reached frame 37 on page 86 of The Little Learner.
Artificial Enlightenment
This algorithm is known as
optimization by gradient descent†
†Thanks, Augustin-Louis Cauchy (1789-1857)
—The Little Learner, p. 86
At the end of our reading last week, we learned the name of the algorithm whose parts have been slowly introduced to us: optimisation by gradient descent. The authors of The Little Learner attribute this algorithm to a nineteenth-century mathematician, Cauchy.
As I write this, I am on a train without Internet. I can’t look up Cauchy and verify his claim to have invented optimisation by gradient descent. But the attribution of the algorithm to Cauchy is intriguing. Cauchy, presumably, never studied anything like an Artificial Neural Network (ANN), the kind of model discussed in The Littler Learner. Cauchy was not at MIT in the 1960s, when Marvin Minsky and others developed the first ANNs, then called “perceptrons.” Cauchy therefore cannot have described how to use gradient descent to set the parameters of an ANN. This work was carried out predominantly by Hinton, Bengio and Lecun, who share a Turing Prize (Hinton also earned a Nobel for this work).
What are the authors of The Little Learner trying to tell us, when they attribute this algorithm to Cauchy? This is one of many such attributions in The Little Learner, and most of the attributions are to long-dead mathematicians like Cauchy. Presumably all the theorems presented in the book were developed by someone, but the authors seem to take especial delight in attributing old discoveries.
To me, these sardonic references seem to have two implications: machine learning is not new; and it has little to do with technology. Machine learning is a collection of mathematical ideas that have been developed mostly for other reasons over the course of centuries. Implicitly, this history is slow, incremental, and intellectual. Machine learning is not a field of wild Promethean inventions, but a field of patient, humble scholarship. If this is the implication that the authors The Little Learner were trying to convey, then they have touched a sympathetic string in the heart of this reader. If not, well, I claim my right as a “strong poet” to “misread” their work.
Revving the Machine
In our session this week, the main content was an implementation of the revise function. This function takes three inputs:
-
f: the objective function for the model we want to “fit” to the data. As we saw in a previous post, the objective function is a compound of three things: (a) the x- and y- values of the training data; (b) the loss function that will be used to judge the accuracy of the model when applied to the x and y values; and (c) an empty skeleton of the model, which basically knows how many parameters the model has and how they all fit together. What f is waiting for is the parameters of the model.
-
revs: the number of revisions that the revise function should attempt. The authors assure us that “a fixed number of revs is usually a good approach” with large, complex models such as GPT-4. Basically, with a large model like GPT-4, training is very expensive, and there is not well-defined best answer. So you simply let the training loop run a fixed number of times and hope that the resulting model is good enough.
-
theta: the current guess for the parameters of the model.
In a nutshell, the revise function tries to come up with a good set of parameters for the model. First it plugs theta into f to calculate the loss. Then it uses the loss to improve theta, and goes back to the start. It plugs the new theta into f to calculate the loss. Then it uses the (new) loss to improve the (new) theta, and goes back to the start. It once again plugs the new theta into f to calculate the loss, and continues until it has improved the guess revs times. For the first example of the revise function, the authors suggest running it for $1000$ revs.
LISP for Learning
In order to show the revise function, the authors have to take a detour into the syntax of Scheme/Racket. They introduce the map function, which is needed in order to perform the update on all the parameters in theta. As is so often the case, this detour is both elegant and counter-intuitive. The authors ask readers to develop their understanding of recursive functions and their understanding of machine learning at the same time.
Moderating the group, I have found this kind of thing frustrating. I want everyone in the group to enjoy themselves, but the book is periodically quite demanding because of the way it interleaves multiple strands of understanding at once. The subtitle of the book is “A straight line into deep learning,” but the book is more like a gradient descent down a bumpy, swerving error curve.
Of course, all technical fields are difficult, because they require learners to master new abstractions. Overall, I find that the authors of The Little Learner do a good job of ordering and prioritising their material, and the book certainly appeals strongly to my own imagination. But I wonder if it would be worth hopping off the “straight line” of The Little Learner for a session or two, so we could explore the deep links between recursive function theory and AI in general. A detour into the delightful writing of Douglas Hofstadter might be on the cards…
Notes
29 Jul 2025
It’s been a while since we have met, but we have made progress through The Little Learner since our detour into HESPI. We are steadily moving towards the back-propagation algorithm, the fundamental learning algorithm for all artificial neural networks. We will resume our quest next time on page 78.
The back-propagation algorithm was developed by Geoffrey Hinton, Yoshua Bengio and Yann Lecun, who in 2018 jointly received the Turing Award from the Association for Computing Machinery for their discovery. It is called back-propagation because it propagates the loss back through the neural network. What does this mean? I break it down below.
Why do we need back-propagation?
As is their wont, the authors of The Little Learner do not straightforwardly explain why the back-propagation algorithm is necessary. Instead, they try to reveal this necessity gradually, through their Socratic style.
They begin with a bad leaning algorithm: simply pick a parameter of the model, adjust it by an arbitrary amount, and then see if the guess improves. If it improves, do the same thing again. If it doesn’t improve, stop—you’ve reached the best answer you can find. (I covered this algorithm in a previous post.)
There are two big drawbacks to this algorithm:
- You can only adjust one parameter at a time. If you are training GPT-4, with hundreds of billions of parameters, this will be an excruciatingly slow process.
- The improvement is arbitrary. There is no guarantee that you are changing the right parameter the right amount in the right direction (i.e. should that parameter get bigger by addition or smaller by subtraction?)
Backpropagation remarkably solves both these issues. It allows you to adjust all the parameters at once, and it allows you to calculate how much you should increase or decrease each parameter to (more or less) ensure that the model actually improves.
So far we are focussing on drawback 2: working out how much to increase or decrease each parameter. It turns out that you can use the objective function to determine both the magnitude and direction of change. You measure the error of the model, then adjust its parameters by a multiple of the error. That this works may seem spooky, but can be explained through calculus.
Three pieces: loss, learning rate, propagation
The loss is a measurement of how well the network performs. We have already seen that to train a neural network, you need to define an objective function that the system can use to test itself against the training data. In our reading last session, we found out that you can use this loss to help you improve the network’s parameters.
To do this, we were told, you should first calculate the loss, and then multiply it by a new number, the learning rate or rate of change. The learning rate is necessary because the loss can be very large. If you naively subtract the loss from a parameter, then you are likely to overcorrect the model, and make it worse. Thus you multiply the loss by a small number, e.g. $0.0099$, and then use this much smaller number to improve the relevant parameter.
When you take the loss, and use it to adjust the parameters of the model, this is called propagation. At the moment, we are working with an extremely simple model, with only two parameters in a single layer. Later, when we encounter more complex models that have many layers of parameters, this idea of propagation will make more sense. First you adjust the last layer, then the second last layer, then the third last layer, and so on. Because this process begins in the final layer of the network, and moves back towards the first layer, it is called back-propagation.
Of course this implies that there is something called “forward-propagation,” and indeed there is such a thing. This is when you push data into the model to produce the output. The data begins in the first layer, then the output of the first layer is passed to the second layer, and so on, until it reaches the output layer.
Thus during training, there is a loop of forward- and back-propagation. First a batch of training data is put into the model. It is propagated forward to produce the output, which is this evaluated using the objective function to get the loss. During the back-propagation, the loss is multiplied by the learning rate, and then fed back through the network to adjust all the weights. This is repeated over and over until the network reaches a desired state.
How humans learn
The Little Learner is not only a book about how machines learn, but a book about how humans learn. We have discussed the book’s dialogic style and use of Scheme/LISP many times in the Anticodians, and it is safe to say that the authors’ view of human learning is controversial. The discussion proceeds logically, and the student’s ignorance is (for the most part) modelled by the voice in the second column. But the authors’ decision to avoid all foreshadowing (or to use the wanky word, prolepsis) does sometimes make the presentation a bit confusing. We are always heading in a direction, but they never tell us what that direction is.
Perhaps they could have taken some inspiration from artificial neural networks and the back-propagation algorithm. When a neural network is trained via back-propagation, it is given all the answers in advance, and then is allowed to find a path towards the answers in small incremental steps. The Little Learner might be easier for beginners if they too are provided a picture of the answer before commencing their incremental journey towards it!