Coding in Many Dimensions

This week we read up to page 41 of The Little Learner, and “learned” a bit about the tensor, the “fundamental data structure in deep learning”.

Kurzgesagt

In a nutshell, a tensor is a multidimensional array. You can have one-dimensional tensors (also called vectors):

$$ \begin{Bmatrix} 1.0 & 27.1 & x^2 & -0.7
\end{Bmatrix} $$

You can have two-dimensional tensors (also called matrices):

$$ \begin{Bmatrix} 42 & 42 & 42 \\ 42 & y & 42 \\ 42 & 42 & 42
\end{Bmatrix} $$

You can have three-dimensional tensors. This tensor could represent a (very small) bitmap image, with four pixels, each of which has a colour specified by three values for red, green and blue:

$$ \begin{Bmatrix} \begin{Bmatrix}7 \\ 3 \\ 255\end{Bmatrix} & \begin{Bmatrix}62 \\ 107 \\ 7\end{Bmatrix} \\ \begin{Bmatrix}200 \\ 200 \\ 100\end{Bmatrix} & \begin{Bmatrix}62 \\ 34 \\ 254\end{Bmatrix}
\end{Bmatrix} $$

Each of the numbers in a tensor is called a scalar. Presumably this name comes about because of the way scalar multiplication works. If you multiply a tensor by a single number, then you simply multiply all the members of the tensor by that number:

$$ \begin{Bmatrix} 42 & 42 & 42 \\ 42 & y & 42 \\ 42 & 42 & 42
\end{Bmatrix} \times 6 = \begin{Bmatrix} 42 \times 6 & 42 \times 6 & 42 \times 6 \\ 42 \times 6 & y \times 6 & 42 \times 6 \\ 42 \times 6& 42 \times 6 & 42 \times 6
\end{Bmatrix} $$

Thus the number $6$ “scales” the tensor, and can be called a scalar.

(I have just read the relevant Wikipedia page to verify my conjecture about the etymology of “scalar,” and I’m not totally wrong. The term originally derives from the Latin for “ladder.”)

(The derivation of terms in linear algebra is always interesting. As mentioned above, a two-dimensional tensor is normally called a “matrix.” Believe it or not, this word derives from the Latin for “womb”!)

Scheme doesn’t natively support tensors. It only natively supports one-dimensional arrays. The malt package devised for The Little Learner adds in support for tensor operations. The key rule of tensors is that they need to have a consistent shape. This is not a valid tensor, for example:

$$ \begin{Bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & & 9 \\ 10 & & \end{Bmatrix} $$

The “second” or “outer” dimension is okay: there are three columns. But the “first” or “inner” dimension is no good: each column has a different number of rows:

$$ \begin{Bmatrix} 1 \\ 4 \\ 7 \\ 10 \end{Bmatrix} \begin{Bmatrix} 2 \\ 5 \end{Bmatrix} \begin{Bmatrix} 3 \\ 6 \\ 9 \end{Bmatrix} $$

The malt package enforces this rule. If you try to create this faulty tensor with mismatched columns, you will get an error:

(tensor (tensor 1 4 7 10)
        (tensor 2 5)
        (tensor 3 6 9))

; tensor: Mismatched shapes: (#(1 4 7 10) #(2 5) #(3 6 9))

Some functions

We learned a few useful malt functions in the first part of the chapter.

  • scalar?: is this object a scalar?
  • tensor?: is this object a tensor?
  • shape: what is the shape of this tensor? How many rows, columns, aisles etc. does it have? What is the length of each of its dimensions? For example, if you have a full-colour bitmap image in 1800 x 1169 resolution, and asked for (shape my-big-image), you would get '(1800 1169 3) as the answer: 1800 columns, 1169 rows and 3 colour values.
  • tlen: what is the size of the outer dimension of this tensor? E.g. if you have a matrix with 6 columns and 3 rows, the tlen of the matrix will be 6. You can use tlen to implement the shape function.
  • rank: how many dimensions a tensor has. For instance, your typical spreadsheet is a two-dimensional tensor, made up of rows and columns. If you had your spreadsheet inside a Racket program, and typed (rank my-spreadsheet), out would pop the answer 2.
  • tref: how to get items out of a tensor. For instance, if you wanted the third column of a tensor called my-spreadsheet, you could write (tref my-spreadsheet 2).

Our upward course…

Sometimes scalars and tensors can be hard to tell apart. For example, here is a scalar:

$$ 6 $$

In Scheme/Racket, that would be:

6

Here is a one-tensor:

$$ \begin{Bmatrix}6\end{Bmatrix} $$

Or in Scheme/Racket:

(tensor 6)

Now here is a two-tensor, or matrix, with one row and one column:

$$ \begin{Bmatrix}\begin{Bmatrix}6\end{Bmatrix}\end{Bmatrix} $$

(tensor (tensor 6))

And a three-tensor…

$$ \begin{Bmatrix}\begin{Bmatrix}\begin{Bmatrix}6\end{Bmatrix}\end{Bmatrix}\end{Bmatrix} $$

(tensor (tensor (tensor 6)))

And a four-tensor…

$$ \begin{Bmatrix}\begin{Bmatrix}\begin{Bmatrix}\begin{Bmatrix}6\end{Bmatrix}\end{Bmatrix}\end{Bmatrix}\end{Bmatrix} $$

(tensor (tensor (tensor (tensor 6))))

And a five-tensor…

$$ \begin{Bmatrix}\begin{Bmatrix}\begin{Bmatrix}\begin{Bmatrix}\begin{Bmatrix}6\end{Bmatrix}\end{Bmatrix}\end{Bmatrix}\end{Bmatrix}\end{Bmatrix} $$

(tensor (tensor (tensor (tensor (tensor 6)))))

What is a dimension anyway? We may leave the last words to A. Square:

And once there, shall we stay our upward course? In that blessed region of Four Dimensions, shall we linger on the threshold of the Fifth, and not enter therein? Ah, no! Let us rather resolve that our ambition shall soar with our corporal ascent. Then, yielding to our intellectual onset, the gates of the Sixth Dimension shall fly open; after that a Seventh, and then an Eighth— How long I should have continued I know not. In vain did the Sphere, in his voice of thunder, reiterate his command of silence, and threaten me with the direst penalties if I persisted. Nothing could stem the flood of my ecstatic aspirations.

Is a neuron a line?

Today we read up to frame 29 on page 27 of The Little Learner.1 You can read/run the code from the chapter in our github repo for this reading.

$y = wx + b$

In this session, we learned how to represent this familiar function in Racket. The form in which we eventually cast the function was at first sight, rather strange:

(define line
  (lambda (x)
    (lambda (theta)
      (+ (* (zeroth-member theta) x) ; wx
         (first-member theta)))))    ; b

There are three main ways that this representation differs from the $y = mx + b$ we all learned at school:

First diffeence: $m$ is $w$

The authors don’t tell us why they use $w$ for the coefficient of $x$, rather than the more familiar $m$, but it presumably because the parameters which are coefficients are typically called the “weights” of a model. They left $b$ as is, presumably because the non-coefficient parameters of a model are typically called the “biases.” Thus $y = wx + b$ can be read as “$y$ equals $x$ times its weight plus the bias.”

Second difference: $w$ and $b$ are stored in a “tensor”:

We’ve not been introduced to “tensors” yet, but they are implicit in the notation used in the chapter (see page xxiii of the book). For now, a “tensor” is just an array of numbers. In the final version of line, w and b are contained in a single value theta, which looks like this: (tensor w b). Since x needs to be multipled by w, we need to get the “zeroth” member of theta: (zeroth-member theta). Since b needs to be added to this, we need to get the “first” member of theta: (first-member theta).

Third difference: line is not one function, but two!

We normally think of $y = mx + b$ as a single function. We could rewrite it as $f{x} = mx + b$. But there are two lambda expressions in line, meaning there are two functions. What is the deal? The first lambda is a “function maker.” As its input, it takes some x. It then gives us a new function, where x is fixed, and theta (i.e. w and b) aren’t known yet. If you plug theta into this new function, out will pop the answer.

(
    (line 10.0)      ; new function, where x=10
    (tensor 4.0 2.0) ; now plug in w=4.0, b=2.0
)
;; answer = 42

This is called a “parameterized” function, the authors explain. In an “unparameterized” function, w and b would be fixed. But now we allow the function to change according to some parameters that we give it. It could be $y=7x+3$ or $y=22x -9$ or $y=(z+58)x - (31z)$. Because it is a “parameterized” function, we can try out different “parameters,” and let the computer choose the right ones.

(define
    machine-learning
    (find-the-right-parameters
        the-machine))

The reason for all this is explained in the book. In machine learning, you usually know $x$ and $y$: this is your ‘training data’. The problem is to work out what $w$ and $b$ should be. Hence our line function is constructed so that it is given x to begin with, and only later is supplied with w and b. In principle, we could try many different values for w and b, and see which combination of w and b produces the correct y for a given x. If the computer tries out the combinations, and chooses the w and b for us, then this is called “machine learning.”

We had a wide ranging discussion about the ‘learning’ metaphor in this context. Do linear functions provide a good model of learning? Is it possible that humans “learn” by adjusting activations of neurons in the same manner that a machine learning algorithm adjusts the parameters of a large and complex linear function? The discuss was very wide-ranging, my notes are inadequate, and we have only just scratched the outermost surface of this topic into whose deeps we are about to plunge.

Computer programming is a highly metaphorical pursuit, as the wisest heads in the trade are quick to admit. Programmers’ source code is a textual model of the world they are trying to enact in their software. Programmers have different conceptualisations of what they are doing, and the code they write reflects the world-picture that tells them how to write it. Well—these are the ambits of our group, in any case, and the aptness of the “learning” metaphor will be a topic of conversation for many meetings to come…

  1. Once again, the moderator of the group didn’t look ahead, to realise that the chapter ended on the following page… but the discussion that waylaid us was good in any case. 

Now we have this insight

This week we read up to frame 53, on page 16 of The Little Learner.1

The aesthetics of recursion

This week the authors treated us to an extremely concise introduction to the theory of recursion. They presented recursion as a strategy of extreme parsimony. Recursion allows us to write “interesting programs” with no loops. Recursion allows us to implement integer addition without using +. One member of the group marvelled at the resulting programs. He found them elegant. Did anyone else?

We discussed more largely the value of parsimony in reasoning. One member of the group observed that parsimony—or elegance—is an aesthetic value that transcends disciplinary boundaries. In a recent article for Critical Inquiry, Imogen Forbes-Macphail observes that

Like mathematicians, literary scholars find beauty in the pursuit of their work; in the products of that work (critical arguments or scholarship); and in the objects of that scholarship, literary artifacts themselves. (2025, p. 481)

Forbes-Macphail is not the first to identify a significant aesthetic dimension in scientific thought, though it is interesting to argue that the aesthetics of mathematical and literary inquiry are similar. What does elegance mean in literary criticism? Do literary critics have the same relish for parsimony as LISP hackers like the authors of The Little Learner? Is an interpretation of The Rover more powerful if it can be made using fewer concepts? The old debate among creative writing instructors, about the merits of ‘minimalist’ and ‘maximalist’ style, rears its head again.

This discussion hearkens back to Knuth’s theory of ‘psychological correctness’, which we discussed last year. I also note Douglas Hofstadter and Melanie Mitchell’s brilliant argument about the vitality of ‘aesthetic perception’ in science, in their jointly-authored chapters for Fluid Concepts and Creative Analogies.

Now we have this insight

Several times in the exposition, the authors claim that something “gives us an insight,” or that we now “have this insight.” “Do we?” quipped one member of the group.

The central conceit of the book is that we, the readers, are identical with the voice in the second column. The second voice models our own experience. Of course, this whole literary structure implies that we are not the voice in the second column. The book constantly entreats us to compare ourselves to this model student, and respond to the teacherly voice in the first column in our own way. The irony of the form finds its counterpart in the irony of the reader’s response.

Why do we ‘invoke’?

The idea of recursion is that a function “invokes itself.” We paused for a while on the idea of “invocation.” Why is it that we “call” or “invoke” functions? What is the underlying metaphor?

The discussion recalled to me these famous sentences from the beginning of The Structure and Interpretation of Computer Programs:

The evolution of a process is directed by a pattern of rules called a program. People create programs to direct processes. In effect, we conjure the spirits of the computer with our spells. (1996, p. 2)

One member of the group preferred “invoke” to “call,” for the very reason that it implies that the “invoker” has command over the function they summon to their bidding. On the final verge of computation, at the very brink of the machine, when symbols have lost their meaning and the stack trace has buried itself in silicon, the programmer may find herself in the position of Byron’s Manfred:

I have no choice; there is no form on earth
Hideous or beautiful to me. Let him,
Who is most powerful of ye, take such aspect
As unto him may seem most fitting.—Come!

“Come, add1!” cries the wizard in his misery. “Come, add1, and increment my integer!”

The efficiency of Scheme

One member of the group asked how it is possible to write efficient programs in Scheme, when recursion is required for all looping.

The answer: tail-call optimisation. A topic slightly off the main track of our Critical Code Studies group! But a fascinating one regardless… parsimony strikes again.

References

Forbes-Macphail, Imogen. “The Four-Color Theorem and the Aesthetics of Computational Proof.” Critical Inquiry 51, no. 3 (March 2025): 470–91. doi:10.1086/734121.

Hofstadter, Douglas R., Daniel Defays, David Chalmers, Robert French, Melanie Mitchell, and Gary McGraw. Fluid Concepts and Creative Analogies: Computer Models of the Fundamental Mechanisms of Thought. New York: Basic Books, 1995.

Sussman, Gerald J, Julie Sussman, and Harold Abelson. Structure and Interpretation of Computer Programs. Second edition. Cambridge: MIT Press, 1996.

Notes

  1. Due to a cognitive deficiency of the group leader, we did not actually complete Chapter 0, which we would have had ample time to do… 

What is, was and shall be

In our session this week, we continued to learn the basics of the Scheme/Racket programming language, working though pages 4-8 of The Little Learner.

As so often happens in close reading, our attention was arrested by an apparently innocuous word: “is.” The word “is” has a peculiar meaning in the language of the book. The authors frequently write that something “is” or “is the same as” something else. For example, they pose the question

What is

(area-of-rectangle 3.0)

?

The answer:

(λ (height)
  (* 3.0 height))

Or again later:

The expression

(add3 4)

is the same as

((λ (x)
  (+ 3 x))
 4)

There is a curious inversion to their presentation. First they present a series of these examples, where s-expressions are evaluated, some involving closures, in which a higher-order function returns a new function that ‘remembers’ some values passed to the higher-order function. Then, they admit that their use of “is” and “is the same as” is not entirely idiomatic:

This way of remembering arguments passed in for formals of outer functions inside inner functions is known as β-substitution.

In other words, the word “is” actually means “can be transformed into via β-substitution.” Two s-expressions “are the same expression” when they can be transformed in this way. But what does this transformation entail? It entails taking the name of something, e.g. add3, height, area-of-rectangle, and replacing it with its value. Is “is” the right word for this? Is the name of a thing “the same as” the thing?

This use of “is” conflicted with intuitions we had in the group. It seems paradoxical to say that the name of something “is the same as” the thing itself. Of course, it is reasonable in the context of evaluating Scheme code. Whenever the code is run, it will be evaluated, so there is a sense in which the code simply is what it evaluates to. This sense of “is” also makes sense in an intellectual culture dominated by mathematics. In everyday algebra, there is no real distinction between equality and identity.

$$3 + 2$$

really is

$$1 + 4$$

Isn’t it? They are equal. Who cares how they are written?1 Seeing this “is”, of course, can take some work. How many people can really remember why $$a^2$$ really is $$b^2 + c^2$$ in a right triangle?

In everyday life, we are quite happy to “dereference” or “substitute” names for the things themselves. When I ask you to “pass the pepper,” I’m quite happy when you hand me the pepper grinder. You ask, “Is this what you wanted?” I reply, “Yes, it is!”

But nonetheless there is something alarming in being told that two things “are” one another when you haven’t internalised the substitution process that allows you to move between them. And names do have a reality of which we are sometimes reminded. If I ask a Canadian to “pass me the pepper,” and they give me a capsicum, I may be disappointed.

The whole discussion reminded me of piece by Lewis Carroll, in which a person’s name has a name, which itself has a name, which itself has a name, and so on. I was sure that this infinite regress featured in Gödel, Escher, Bach, but I have tried and failed to find either the Carroll Story or the Hofstadter variation on it! Is an intimation of a thing the same as the name of a thing? Or is the intimation the thing? Or is the name the intimation? Can a vague recollection be substituted for a textual authority? Or only for a vague apprehension…?

We recommence next week on frame 24, at the top of page 9.

  1. I’m sure there are varieties of algebra where identity matters—but that is way beyond my knowledge! 

Psst! Psst! Psst!

Today we commenced The Little Learner, the text that will occupy the group for many months to come. We read the Preface and the first page of Chapter 0.

Our discussion focussed mainly on the book’s authorial persona and implied reader. For those of us in the group who have a mainly adversarial attitude towards AI, the book presented a challenge. Isn’t deep learning interesting and fun? Aren’t the algorithms elegant and surprisingly simple? Shouldn’t everyone dive into this fresh and exciting area of research, and learn how to do it?

To invite the reader into the text, Friedman and Mendhakar carefully establish the reader as a novice, and themselves as kind, avuncular teachers. The reader need only know “high-school maths” and have a minimum of “programming experience.” The book proceeds from these foundations in a strict order, to build up from simple pieces the whole complex machinery of modern deep learning.

As some in the group observed, this “novice” reader was already expected to know some terms of art. Concepts such as “problem domain,” “equalities,” “invariants,” “superset” and “subset” were introduced as though they were the general coinage of the realm. Of course, all textbook writers face the problem that their students need to somehow learn the language that even makes it possible to express knowledge of their subject. How can you learn anything about a topic without having the words to describe the topic? But we as a group are intrigued to see precisely who or what these writers assume an interested and relatively ignorant reader to be as the book progresses.

We discussed the possible ideological implications of the book. Is this a book that subtly asserts a “tech-bro” persona? Or does its goofy and academic tone bespeak a different attitude? In our disciplines, we worry endlessly about surveillance capitalism, about the power of tech billionaires, about the algorithmic mediation of human interaction. The writers of The Little Learner sidestep such issues. Deep leaning is fun. It’s for categorising cat photos, not for empowering intelligence agencies to more rapidly scan citizens’ text messages. It’s something anyone can do as a hobby, rather than a tool used by rich and powerful people to make themselves richer and more powerful.

Everyone agreed the book is fun, and the topic is interesting. We will see in coming months how we can reconcile the fun with the cultural critique.

Any code written in the sessions can be found in the Github repository for this reading.