21 May 2025
Today we read up to Chapter 3, Frame 37, on page 67 of The Little Learner. You can find the code for the session in the Github repo.
Three steps to learning
So far we have only seen the data structures for deep learning. Today, we started to see how to apply learning algorithms the data.
At a high level, we found that deep learning involves three steps:
- Guess the function parameters
- Measure the ‘loss’
- Improve the guess
You keep doing 1., 2. and 3. until finally the ‘loss’ is nearly $0.0$. At this point, further improvement becomes impossible, and you stop.
What a guess looks like
The first step is the “guess the function parameters.” How do we do this?
For the sake of this chapter, a guess is a list of parameters. Let’s say we are trying to fit a straight line to some data. A straight line can be represented by the following function:
$$
f(x) = wx + b
$$
In a machine learning situation, we already know $x$ and $y$ for a large number of entities. What we don’t know is what $w$ and $b$ should be. So, let’s make a guess! We can put $w$ and $b$ together into a list, and call that list $\theta$. In the book, the initial guess was $w = 0.0$ and $b = 0.0$. That results in the following $\theta$:
$$
\theta = (0.0, 0.0)
$$
or, in Scheme/Racket:
(define theta (list 0.0 0.0))
That’s it—a guess is a list
of numbers. Now the line
function is very simple. There are only two parameters, $w$ and $b$. A more complex function may have many parameters, and those parameters may have a very complex structure. In the future, theta
may be a very complex variable—a whole collection of tensors all linked together. But that is, of course, the future.
How to calculate the loss
So far we have encountered just one “loss function,” which allows us to measure how good our latest guess it. The loss function works as follows:
- Use the current guess for the parameters to produce some predicted
y
-values for the x
-values in the training set
- Subtract these predicted
y
-values from the real y
-values that we already know
- Square the differences
- Take the sum of the squares
This is called the “l2 loss”. The “2” comes from the fact that we square the errors, which is the same as raising them to the power of 2.
The l2-loss is defined as follows in the text:
(define l2-loss
(lambda (target) ; the function we are trying to 'learn'
(lambda (xs ys) ; the data: the x-values and y-values we have
(lambda (theta) ; our current guess for the function parameters
(let ((pred-ys ((target xs) theta)))
(sum
(sqr
(- ys pred-ys))))))))
Since the target
function and xs
and ys
will stay the same throughout the whole training process, we can fix these in place, producing an “objective function.” The objective function already knows which function we would like to learn parameters for. It already knows what the x
-values and y
-values are in the training data. It is just waiting for some guesses that it can try out. In the session, we tried the following guesses for $\theta$, using line
as the target function, and some simple points as the xs
and ys
:
; define objective function
(define objective-function (expectant-function line-xs line-ys))
; try first guess
(objective-function (list 0.0 0.0))
; = 33.21
; try second guess
(objective-function (list 0.0099 0.0))
; = 32.5892403
We could keep on trying the algorithm of increasing $\theta_0$ by $0.0099$:
(objective-function (list 0.0198 0.0))
; = 31.9743612
(objective-function (list 0.0297 0.0))
; = 31.3653627
If you perform this operation 106 times, you get the best possible answer:
(objective-function (list (* 106 0.0099) 0.0))
; = 0.13501079999999996
If you keep going, the loss starts to go up again. That is, the guesses get worse:
(objective-function (list (* 107 0.0099) 0.0))
; = 0.13759470000000001
As the authors of The Little Learner would say, after 107 iterations, we have “overshot” the best possible answer.
How to improve the guess
Clearly just increasing the $\theta_0$ by $0.0099$ is not a great algorithm for learning the function. It takes many iterations to learn an extremely simple function. Looking at the graph on page 58, the correct function is clearly something very close to $y = (1)x + 0$. Shouldn’t this be pretty easy to find? And what if we need to learn $b$ as well as $w$? This algorithm assumes that $b = 0$!
Well, we’ve not seen a better solution yet, but there are many chapters to go, and many layers left in this onion. Assuredly by the end of all this, we will know the arcana of the field.
First interlude
Next meeting, we take a break from The Little Learner and will look at the LLM code for HESPI, a fascinating AI-for-GLAM project.
07 May 2025
The blog has been irregular of late due to my teaching schedule. But we are ploughing on through The Littler Learner. This week we reached the end of “Interlude I: The More We Extend, the Less Tensor We Get.”
The gist
Interlude I concludes our basic introduction to the tensor, the fundamental data structure of deep learning. Chapter 2 covered how to create tensors, and how to use them with the line
function to represent parameterised linear equations. “Interlude I” shows how to do arithmetic with tensors.
The joke in the title—”The More We Extend, the Less Tensor We Get,” has to do with the way a tensor’s rank changes when you perform arithmetic on it. The basic arithemetical operations, such as +
and /
, are defined only for scalar values such as 3
and 708.4
. To compute the “sum” or “quoutient” of a tensor or tensors, the concepts of “sum” and “quotient” need to be extended. That’s the first part of the title—”The More We Extend.” The second part of the title refers to the way that a tensor’s rank is affected by arithmetic. For example, if you take the sum of a tensor, its rank is reduced by one. Consider the following example:
$$
sum(
\begin{Bmatrix}
3 & 22 \\ 6 & 5 \\ 9 & 4
\end{Bmatrix}
)
$$
or, in Scheme (using malt):
(sum (tensor (tensor 3 6 9)
(tensor 22 5 4)))
This is a tensor2. It has rank 2. Intuitively, it is two-dimensional. The two dimensions are the rows and columns. You can see the number of dimensions in the Scheme/Racket code, because there are two layers of tensor: an outer tensor, and the two inner tensors.
When you compute the sum of this tensor, the rank drops by one, and you get a tensor1 as the answer:
$$
\begin{Bmatrix}
25 \\ 11 \\ 13
\end{Bmatrix}
$$
(tensor 25 11 13) ; the answer
If you take the sum of this tensor, the rank falls by 1 again, and you get a tensor0, that is, a scalar:
$$
sum(
\begin{Bmatrix}
25 \\ 11 \\ 13
\end{Bmatrix}
) = 49
$$
(sum (tensor 25 11 13))
; =
49
The more we extend—the more “extended” arithmetical operations we apply to a tensor—the less tensor we become—the lower becomes the tensor’s rank.
ASMR
As a group, we have been slowly working out how to actually recite the code in the book. It is remarkably difficult to read code in a natural way. One of the main problems is how literally to read the symbols. For example, try to read this out in English:
; a tensor-3
(tensor (tensor (tensor 3 4)
(tensor 7 8))
(tensor (tensor 21 9)
(tensor 601 1)))
One way is to read the symbols very literally:
Open parentheses tensor
. Open parentheses tensor
. Open parentheses tensor
, 3
, 4
, close parentheses. Open parentheses tensor
, 7
, 8
, close parentheses. Close parentheses. Open parentheses tensor
. Open parentheses tensor
, 21
, 9
, close parentheses. Open parentheses
tensor
, 601
, 1
, close parentheses. Close parentheses. Close parentheses.
The problems with this approach are obvious. If you managed to read that blockquote without your eyes glazing over, well done! Your powers of concentration are supreme. The approach that we have been nutting out as a group involves interpreting the symbols a bit more:
A tensor3, consisting of two tensors2, the first tensor2 containing the tensors1 3
-4
and 7
-8
. The second tensor2 contains the tensors1 21
-9
and 601
-1
.
This form of recitation is superior for human comprehension. Why are there all those parentheses, words and digits? Why, to create a tensor3, consisting of two tensors2 … etc. When we read this way, we are essentially “evaluating” the code in the way that the Scheme interpreter evaluates it. We work out what data structure is represented by all the parentheses and other symbols, and try to represent that data structure in a way that works with our hardware—our minds!
One waggish member of the group suggested that code-reading is a kind of ASMR. It is indeed hypnotic to read out so many of these tensors. When we recite a “same-as chart,” which shows the evaluation of a complex expression, the effect is similarly hypnotic. At the beginning of every line of a “same-as chart,” the reader says “which is the same as.” It is not the most intrinsically poetic refrain, but its repetition a dozen times in the space of a few minutes does lull the mind.
Does this “ASMR” quality tell us something about the nature of code as a form of writing? I think it might, but I’m not sure exactly what it tells us. I wonder if there is some useful analogy to be made between code and minimalism in art and music. Like minimalist art and music, code is repetitious, and it is this repetition more than anything that creates the ASMR effect noted by the group.
Where are the ANNs?
One member of the group expressed some—mild and good-natured—frustration at our progress through they book. “I don’t feel like I’ve learned anything about deep learning yet.”
I can understand this point of view, even though it is “wrong” in an obvious way. Another member of the group observed that she has already learned may things about deep learning that she wished she had known earlier. Learning about the data structures used to build artificial neural networks (ANNs) is clearly important, if you want to learn how ANNs are built!
But it is easy to see why a reader at this point in the book might be frustrated, and not feel that they have learned anything yet about ANNs. I think there are two good reasons for this feeling:
- The dialogic style of the book does not lend itself to summary and signposting. The authors do not explain why they are introducing certain topics at certain times. The book tries to maintain the Socratic framework in which the teacher (left column) has the entire subject mapped out in advance, and the student (right column) is just happy to follow the reasoning step-by-step without trying to glance at what’s coming next. This is an interesting decision on the authors’ part. Why not just have their student character ask the teacher from time to time why a given topic is important to the overall progress of the book?
- So far we have learned about data structures only. The
line
function allows a linear equation to be represented in a parameterised form, so its parameters can be learned. The tensor
structure allows the parameters for many interlinked line
functions to be stored in a convenient way, so that line
s can be built up into complex networks that model the relationships between observed values in the training data. While data structures are very important, they somehow seem less essential that algorithms in computing. The authors decided it was important to learn the data structures before the algorithms, perhaps because it is impossible to demonstrate the algorithms without having data for the algorithms to work on. This decision to teach the data structures first makes good sense, but it is bound to cause frustration in certain readers, when they expect to be shown the algorithms that make an ANN work.
The Little Learner is a great book. We are making good progress through it, and it is a lot of fun to read aloud. It is an enlivening challenge to learn how to recite the code. It is an enlivening challenge to try to see the bigger picture as it is gradually sketched out piece-by-piece. Next week we do learn an algorithm, one of the most important algorithms of our time: gradient descent. We will have to see where that slippery slope lands us.
09 Apr 2025
This week we read up to page 41 of The Little Learner, and “learned” a bit about the tensor, the “fundamental data structure in deep learning”.
Kurzgesagt
In a nutshell, a tensor is a multidimensional array. You can have one-dimensional tensors (also called vectors):
$$
\begin{Bmatrix}
1.0 & 27.1 & x^2 & -0.7
\end{Bmatrix}
$$
You can have two-dimensional tensors (also called matrices):
$$
\begin{Bmatrix}
42 & 42 & 42 \\ 42 & y & 42 \\ 42 & 42 & 42
\end{Bmatrix}
$$
You can have three-dimensional tensors. This tensor could represent a (very small) bitmap image, with four pixels, each of which has a colour specified by three values for red, green and blue:
$$
\begin{Bmatrix}
\begin{Bmatrix}7 \\ 3 \\ 255\end{Bmatrix} & \begin{Bmatrix}62 \\ 107 \\ 7\end{Bmatrix} \\ \begin{Bmatrix}200 \\ 200 \\ 100\end{Bmatrix} & \begin{Bmatrix}62 \\ 34 \\ 254\end{Bmatrix}
\end{Bmatrix}
$$
Each of the numbers in a tensor is called a scalar. Presumably this name comes about because of the way scalar multiplication works. If you multiply a tensor by a single number, then you simply multiply all the members of the tensor by that number:
$$
\begin{Bmatrix}
42 & 42 & 42 \\ 42 & y & 42 \\ 42 & 42 & 42
\end{Bmatrix} \times 6 = \begin{Bmatrix} 42 \times 6 & 42 \times 6 & 42 \times 6 \\ 42 \times 6 & y \times 6 & 42 \times 6 \\ 42 \times 6& 42 \times 6 & 42 \times 6
\end{Bmatrix}
$$
Thus the number $6$ “scales” the tensor, and can be called a scalar.
(I have just read the relevant Wikipedia page to verify my conjecture about the etymology of “scalar,” and I’m not totally wrong. The term originally derives from the Latin for “ladder.”)
(The derivation of terms in linear algebra is always interesting. As mentioned above, a two-dimensional tensor is normally called a “matrix.” Believe it or not, this word derives from the Latin for “womb”!)
Scheme doesn’t natively support tensors. It only natively supports one-dimensional arrays. The malt package devised for The Little Learner adds in support for tensor operations. The key rule of tensors is that they need to have a consistent shape. This is not a valid tensor, for example:
$$
\begin{Bmatrix}
1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & & 9 \\ 10 & &
\end{Bmatrix}
$$
The “second” or “outer” dimension is okay: there are three columns. But the “first” or “inner” dimension is no good: each column has a different number of rows:
$$
\begin{Bmatrix}
1 \\ 4 \\ 7 \\ 10
\end{Bmatrix}
\begin{Bmatrix}
2 \\ 5
\end{Bmatrix}
\begin{Bmatrix}
3 \\ 6 \\ 9
\end{Bmatrix}
$$
The malt package enforces this rule. If you try to create this faulty tensor with mismatched columns, you will get an error:
(tensor (tensor 1 4 7 10)
(tensor 2 5)
(tensor 3 6 9))
; tensor: Mismatched shapes: (#(1 4 7 10) #(2 5) #(3 6 9))
Some functions
We learned a few useful malt
functions in the first part of the chapter.
-
scalar?
: is this object a scalar?
-
tensor?
: is this object a tensor?
-
shape
: what is the shape of this tensor? How many rows, columns, aisles etc. does it have? What is the length of each of its dimensions? For example, if you have a full-colour bitmap image in 1800 x 1169 resolution, and asked for (shape my-big-image)
, you would get '(1800 1169 3)
as the answer: 1800 columns, 1169 rows and 3 colour values.
-
tlen
: what is the size of the outer dimension of this tensor? E.g. if you have a matrix with 6 columns and 3 rows, the tlen
of the matrix will be 6
. You can use tlen
to implement the shape
function.
-
rank
: how many dimensions a tensor has. For instance, your typical spreadsheet is a two-dimensional tensor, made up of rows and columns. If you had your spreadsheet inside a Racket program, and typed (rank my-spreadsheet)
, out would pop the answer 2
.
-
tref: how to get items out of a tensor. For instance, if you wanted the third column of a tensor called
my-spreadsheet
, you could write (tref my-spreadsheet 2)
.
Our upward course…
Sometimes scalars and tensors can be hard to tell apart. For example, here is a scalar:
$$
6
$$
In Scheme/Racket, that would be:
Here is a one-tensor:
$$
\begin{Bmatrix}6\end{Bmatrix}
$$
Or in Scheme/Racket:
Now here is a two-tensor, or matrix, with one row and one column:
$$
\begin{Bmatrix}\begin{Bmatrix}6\end{Bmatrix}\end{Bmatrix}
$$
And a three-tensor…
$$
\begin{Bmatrix}\begin{Bmatrix}\begin{Bmatrix}6\end{Bmatrix}\end{Bmatrix}\end{Bmatrix}
$$
(tensor (tensor (tensor 6)))
And a four-tensor…
$$
\begin{Bmatrix}\begin{Bmatrix}\begin{Bmatrix}\begin{Bmatrix}6\end{Bmatrix}\end{Bmatrix}\end{Bmatrix}\end{Bmatrix}
$$
(tensor (tensor (tensor (tensor 6))))
And a five-tensor…
$$
\begin{Bmatrix}\begin{Bmatrix}\begin{Bmatrix}\begin{Bmatrix}\begin{Bmatrix}6\end{Bmatrix}\end{Bmatrix}\end{Bmatrix}\end{Bmatrix}\end{Bmatrix}
$$
(tensor (tensor (tensor (tensor (tensor 6)))))
What is a dimension anyway? We may leave the last words to A. Square:
And once there, shall we stay our upward course? In that blessed region of Four Dimensions, shall we linger on the threshold of the Fifth, and not enter therein? Ah, no! Let us rather resolve that our ambition shall soar with our corporal ascent. Then, yielding to our intellectual onset, the gates of the Sixth Dimension shall fly open; after that a Seventh, and then an Eighth— How long I should have continued I know not. In vain did the Sphere, in his voice of thunder, reiterate his command of silence, and threaten me with the direst penalties if I persisted. Nothing could stem the flood of my ecstatic aspirations.
26 Mar 2025
Today we read up to frame 29 on page 27 of The Little Learner. You can read/run the code from the chapter in our github repo for this reading.
$y = wx + b$
In this session, we learned how to represent this familiar function in Racket. The form in which we eventually cast the function was at first sight, rather strange:
(define line
(lambda (x)
(lambda (theta)
(+ (* (zeroth-member theta) x) ; wx
(first-member theta))))) ; b
There are three main ways that this representation differs from the $y = mx + b$ we all learned at school:
First difference: $m$ is $w$
The authors don’t tell us why they use $w$ for the coefficient of $x$, rather than the more familiar $m$, but it presumably because the parameters which are coefficients are typically called the “weights” of a model. They left $b$ as is, presumably because the non-coefficient parameters of a model are typically called the “biases.” Thus $y = wx + b$ can be read as “$y$ equals $x$ times its weight plus the bias.”
Second difference: $w$ and $b$ are stored in a “tensor”:
We’ve not been introduced to “tensors” yet, but they are implicit in the notation used in the chapter (see page xxiii of the book). For now, a “tensor” is just an array of numbers. In the final version of line
, w
and b
are contained in a single value theta
, which looks like this: (tensor w b)
. Since x
needs to be multipled by w
, we need to get the “zeroth” member of theta
: (zeroth-member theta)
. Since b
needs to be added to this, we need to get the “first” member of theta
: (first-member theta)
.
Third difference: line
is not one function, but two!
We normally think of $y = mx + b$ as a single function. We could rewrite it as $f{x} = mx + b$. But there are two lambda
expressions in line
, meaning there are two functions. What is the deal? The first lambda is a “function maker.” As its input, it takes some x
. It then gives us a new function, where x
is fixed, and theta
(i.e. w
and b
) aren’t known yet. If you plug theta
into this new function, out will pop the answer.
(
(line 10.0) ; new function, where x=10
(tensor 4.0 2.0) ; now plug in w=4.0, b=2.0
)
;; answer = 42
This is called a “parameterized” function, the authors explain. In an “unparameterized” function, w
and b
would be fixed. But now we allow the function to change according to some parameters that we give it. It could be $y=7x+3$ or $y=22x -9$ or $y=(z+58)x - (31z)$. Because it is a “parameterized” function, we can try out different “parameters,” and let the computer choose the right ones.
(define
machine-learning
(find-the-right-parameters
the-machine))
The reason for all this is explained in the book. In machine learning, you usually know $x$ and $y$: this is your ‘training data’. The problem is to work out what $w$ and $b$ should be. Hence our line
function is constructed so that it is given x
to begin with, and only later is supplied with w
and b
. In principle, we could try many different values for w
and b
, and see which combination of w
and b
produces the correct y
for a given x
. If the computer tries out the combinations, and chooses the w
and b
for us, then this is called “machine learning.”
We had a wide ranging discussion about the ‘learning’ metaphor in this context. Do linear functions provide a good model of learning? Is it possible that humans “learn” by adjusting activations of neurons in the same manner that a machine learning algorithm adjusts the parameters of a large and complex linear function? The discuss was very wide-ranging, my notes are inadequate, and we have only just scratched the outermost surface of this topic into whose deeps we are about to plunge.
Computer programming is a highly metaphorical pursuit, as the wisest heads in the trade are quick to admit. Programmers’ source code is a textual model of the world they are trying to enact in their software. Programmers have different conceptualisations of what they are doing, and the code they write reflects the world-picture that tells them how to write it. Well—these are the ambits of our group, in any case, and the aptness of the “learning” metaphor will be a topic of conversation for many meetings to come…
12 Mar 2025
This week we read up to frame 53, on page 16 of The Little Learner.
The aesthetics of recursion
This week the authors treated us to an extremely concise introduction to the theory of recursion. They presented recursion as a strategy of extreme parsimony. Recursion allows us to write “interesting programs” with no loops. Recursion allows us to implement integer addition without using +
. One member of the group marvelled at the resulting programs. He found them elegant. Did anyone else?
We discussed more largely the value of parsimony in reasoning. One member of the group observed that parsimony—or elegance—is an aesthetic value that transcends disciplinary boundaries. In a recent article for Critical Inquiry, Imogen Forbes-Macphail observes that
Like mathematicians, literary scholars find beauty in the pursuit of their work; in the products of that work (critical arguments or scholarship); and in the objects of that scholarship, literary artifacts themselves. (2025, p. 481)
Forbes-Macphail is not the first to identify a significant aesthetic dimension in scientific thought, though it is interesting to argue that the aesthetics of mathematical and literary inquiry are similar. What does elegance mean in literary criticism? Do literary critics have the same relish for parsimony as LISP hackers like the authors of The Little Learner? Is an interpretation of The Rover more powerful if it can be made using fewer concepts? The old debate among creative writing instructors, about the merits of ‘minimalist’ and ‘maximalist’ style, rears its head again.
This discussion hearkens back to Knuth’s theory of ‘psychological correctness’, which we discussed last year. I also note Douglas Hofstadter and Melanie Mitchell’s brilliant argument about the vitality of ‘aesthetic perception’ in science, in their jointly-authored chapters for Fluid Concepts and Creative Analogies.
Now we have this insight
Several times in the exposition, the authors claim that something “gives us an insight,” or that we now “have this insight.” “Do we?” quipped one member of the group.
The central conceit of the book is that we, the readers, are identical with the voice in the second column. The second voice models our own experience. Of course, this whole literary structure implies that we are not the voice in the second column. The book constantly entreats us to compare ourselves to this model student, and respond to the teacherly voice in the first column in our own way. The irony of the form finds its counterpart in the irony of the reader’s response.
Why do we ‘invoke’?
The idea of recursion is that a function “invokes itself.” We paused for a while on the idea of “invocation.” Why is it that we “call” or “invoke” functions? What is the underlying metaphor?
The discussion recalled to me these famous sentences from the beginning of The Structure and Interpretation of Computer Programs:
The evolution of a process is directed by a pattern of rules called a program. People create programs to direct processes. In effect, we conjure the spirits of the computer with our spells. (1996, p. 2)
One member of the group preferred “invoke” to “call,” for the very reason that it implies that the “invoker” has command over the function they summon to their bidding. On the final verge of computation, at the very brink of the machine, when symbols have lost their meaning and the stack trace has buried itself in silicon, the programmer may find herself in the position of Byron’s Manfred:
I have no choice; there is no form on earth
Hideous or beautiful to me. Let him,
Who is most powerful of ye, take such aspect
As unto him may seem most fitting.—Come!
“Come, add1
!” cries the wizard in his misery. “Come, add1
, and increment my integer!”
The efficiency of Scheme
One member of the group asked how it is possible to write efficient programs in Scheme, when recursion is required for all looping.
The answer: tail-call optimisation. A topic slightly off the main track of our Critical Code Studies group! But a fascinating one regardless… parsimony strikes again.
References
Forbes-Macphail, Imogen. “The Four-Color Theorem and the Aesthetics of Computational Proof.” Critical Inquiry 51, no. 3 (March 2025): 470–91. doi:10.1086/734121.
Hofstadter, Douglas R., Daniel Defays, David Chalmers, Robert French, Melanie Mitchell, and Gary McGraw. Fluid Concepts and Creative Analogies: Computer Models of the Fundamental Mechanisms of Thought. New York: Basic Books, 1995.
Sussman, Gerald J, Julie Sussman, and Harold Abelson. Structure and Interpretation of Computer Programs. Second edition. Cambridge: MIT Press, 1996.
Notes