The uses of error

It’s been a while since we have met, but we have made progress through The Little Learner since our detour into HESPI. We are steadily moving towards the back-propagation algorithm, the fundamental learning algorithm for all artificial neural networks. We will resume our quest next time on page 78.

The back-propagation algorithm was developed by Geoffrey Hinton, Yoshua Bengio and Yann Lecun, who in 2018 jointly received the Turing Award from the Association for Computing Machinery for their discovery. It is called back-propagation because it propagates the loss back through the neural network. What does this mean? I break it down below.

Why do we need back-propagation?

As is their wont, the authors of The Little Learner do not straightforwardly explain why the back-propagation algorithm is necessary. Instead, they try to reveal this necessity gradually, through their Socratic style.

They begin with a bad leaning algorithm: simply pick a parameter of the model, adjust it by an arbitrary amount, and then see if the guess improves. If it improves, do the same thing again. If it doesn’t improve, stop—you’ve reached the best answer you can find. (I covered this algorithm in a previous post.)

There are two big drawbacks to this algorithm:

  1. You can only adjust one parameter at a time. If you are training GPT-4, with hundreds of billions of parameters, this will be an excruciatingly slow process.
  2. The improvement is arbitrary. There is no guarantee that you are changing the right parameter the right amount in the right direction (i.e. should that parameter get bigger by addition or smaller by subtraction?)

Backpropagation remarkably solves both these issues. It allows you to adjust all the parameters at once, and it allows you to calculate how much you should increase or decrease each parameter to (more or less) ensure that the model actually improves.

So far we are focussing on drawback 2: working out how much to increase or decrease each parameter. It turns out that you can use the loss function to

Three pieces

The loss is a measurement of how well the network performs. We have already seen that to train a neural network, you need to define an objective function that the system can use to test itself against the training data. In our reading last session, we found out that you can use this loss to help you improve the network’s parameters.

To do this, we were told, you should first calculate the loss, and then multiply it by a new number, the learning rate or rate of change. The learning rate is necessary because the loss can be very large. If you naively subtract the loss from a parameter, then you are likely to overcorrect the model, and make it worse. Thus you multiply the loss by a small number, e.g. $0.0099$, and then use this much smaller number to improve the relevant parameter.

When you take the loss, and use it to adjust the parameters of the model, this is called propagation. At the moment, we are working with an extremely simple model, with only two parameters in a single layer. Later, when we encounter more complex models that have many layers of parameters, this idea of propagation will make more sense. First you adjust the last layer, then the second last layer, then the third last layer, and so on. Because this process begins in the final layer of the network, and moves back towards the first layer, it is called back-propagation.

Of course this implies that there is something called “forward-propagation,” and indeed there is such a thing. This is when you push data into the model to produce the output. The data begins in the first layer, then the output of the first layer is passed to the second layer, and so on, until it reaches the output layer.

Thus during training, there is a loop of forward- and back-propagation. First a batch of training data is put into the model. It is propagated forward to produce the output, which is this evaluated using the objective function to get the loss. During the back-propagation, the loss is multiplied by the learning rate, and then fed back through the network to adjust all the weights. This is repeated over and over until the network reaches a desired state.

How humans learn

The Little Learner is not only a book about how machines learn, but a book about how humans learn. We have discussed the book’s dialogic style and use of Scheme/LISP many times in the Anticodians, and it is safe to say that the authors’ view of human learning is controversial. The discussion proceeds logically, and the student’s ignorance is (for the most part) modelled by the voice in the second column. But the authors’ decision to avoid all foreshadowing (or to use the wanky word, prolepsis) does sometimes make the presentation a bit confusing. We are always heading in a direction, but they never tell us what that direction is.

Perhaps they could have taken some inspiration from artificial neural networks and the back-propagation algorithm. When a neural network is trained via back-propagation, it is given all the answers in advance, and then is allowed to find a path towards the answers in small incremental steps. The Little Learner might be easier for beginners if they too are provided a picture of the answer before commencing their incremental journey towards it!

ChatGPT sees the herbarium

This week we paused our reading of The Little Learner to entertain a guest: Dr Robert Turnbull, a Senior Research Data Specialist at the Melbourne Data Analytics Platform. He talked us through the HESPI project, which discovers structured data in handwritten herbarium specimen sheets. We then read through llm.py, the part of the software which interfaces with an online LLM to perform data cleaning at the end of the pipeline.

We resume The Little Learner on page 67 in our next session.

AI for GLAM

In The Little Learner, we are reading the code for building an LLM from the ground up. We have been learning about basic data structures and algorithms, such as tensors, linear functions and loss functions, which we will eventually assemble like Lego bricks into an Artificial Neural Network. This week we examined an LLM from the top down. Once you have an LLM available, what use is it to critical and historical scholars like ourselves?

HESPI solves a problem that is common for Galleries, Libraries, Archives and Museums (GLAM). GLAM institutions often possess an enormous amount of data, but in a hard-to-use form. In this case, the data is in the form of botanical specimen sheets. A specimen sheet includes a dried specimen of a plant, and some handwritten information about the specimen, such as the scientific name and the location where it was collected. It is easy to photograph a specimen sheet, or to allow a person to look at it in a cabinet, but what if you wanted to estimate the historic range of a plant? What if you wanted to compare the collections of multiple museums? What if you wanted to validate a climate model against the location data available in the sheets? Such analysis is extremely difficult if the sheets are only available as photographs or physical objects.

HESPI discovers structured data in the specimen sheets. It segments each part of the photograph, distinguishing the specimen, the data card and the color swatch. It interprets the text on the sheet, and organises it for entry into a database. As part of this process, an LLM plays a crucial role. The specimen sheet first passes through a number of specialised subsystems that discover certain parts of the data. At the final stage, the data is passed as a prompt to the LLM, which attempts to remove any errors that have crept in from earlier stages in the pipeline. It cleans typos, OCR glitches and so on, and according to Rob and the HESPI team, improves the overall accuracy of the system by about 10 percentage points.

We are still on the crest of an AI hype wave, and the CEOs of self-proclaimed “AI Companies” continue to tout their products as revolutionary devices that make computers more creative and conversational. HESPI presents a more realistic model of what LLMs can do: boring, routine writing tasks that require more-or-less mechanical manipulation of text. When used in this way, LLMs are genuinely useful, and can enable cultural institutions to achieve projects that were hitherto out of reach.

Describing chats in code

In our close reading, we focussed on llm.py, the part of HESPI that hooks in to an online LLM for the final data-cleaning step. The main purpose of the file is to construct a prompt that describes the data-cleaning task for an individual specimen sheet, and then send this prompt to either Claude or ChatGPT for data cleaning.

First we need the building-blocks for a prompt. These are imported at the top of the file:

from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain.prompts import ChatPromptTemplate

langchain is a Python module that provides Python classes for interacting with online chatbots. Each class represents an important part of a prompt. In this case, the HumanMessage will be the question that HESPI asks the system: this is analogous to the prompt that you might type into the online chat interface of ChatGPT. The AIMessage is the answer given by the system itself: this is analogous to the answers you see when you ‘chat’ with ChatGPT online. We will see shortly that HESPI actually tells the AI how to begin its response, in order to discipline its output. The SystemMessage is a hidden message which the chatbot considers before interpreting the HumanMessage or generating its own AIMessage. Typically, chatbots are designed to give extra weight to SystemMessages, so that developers of AI systems have power to control the output of the model, without necessarily knowing what users wil type. You are generally unable to provide such system messages in the public interface for ChatGPT and similar services. The ChatPromptTemplate is the glue that holds the different kinds of message together. Essentially, llm.py receives the output from the prior stages of HESPI, generates a HumanMessage, AIMessage and SystemMessage for the current specimen sheet, and then combines them into a single ChatPromptTemplate that is sent to the relevant chatbot.

Which chatbot, I hear you ask?

from langchain.chat_models.base import BaseChatModel
from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic

The BaseChatModel is an “abstract class,” or a general template for any LLM that you wish to use with langchain. At present, HESPI supports two concrete models: ChatOpenAI (i.e. ChatGPT) and ChatAnthropic (i.e. Claude, Sonnet). As we discussed in the group, this is the first time we have seen brand names in a source file. This is not unusual: it is common enough for source files to contain information about the company that owns them. It is also common for code that interacts with third-party software to include other companies’ brands in it: there are thousands of packages out there with google, SQLite, MySQL, excel or aws scattered among the function names and class definitions. The presence of brand names in HESPI does raise the question, however: as more tasks are offloaded to cloud-based language models such as Claude and GPT-4, how much more frequent will corporate vocabulary become in source code? How will this affect programmers’ relationship to the code they write? Perhaps not at all.

This is the key function of llm.py, which provides the main “entry point” for the script:

def build_template(institutional_label_image:Path, detection_results:dict) -> ChatPromptTemplate:

The function takes two inputs:

  • institutional_label_image: The image file of the “institutional label” on the specimen sheet, which is the textual part of the document with the species name, etc.
  • detection_results: A Python dict containing the output from the prior steps in the HESPI pipeline. This is the data that the LLM will try to clean up, by inspecting the image of the institutional label for itself, and scanning through the language of the detection results.

The output of the function is a ChatPromptTemplate, a composite object containing the various kinds of messages described above. This template can be sent to one of OpenAI or Anthropic’s systems, and the corrections will be returned in a (hopefully) well-structured and useable form.

prompt-english as a programming language

At the heart of the build_template function is a Python f-string that collects information from HESPI and massages it into a textual prompt for the LLM. This then forms the HumanMessage for the system.

main_prompt = f"""
        We have a pipeline for automatically reading the institutional labels and extracting the following fields:\n{', '.join(label_fields)}.
        
        You need to inspect an image and see if the fields have been extracted correctly. 
        If there are errors, then print out the field name with a colon and then the correct value. Each correction is on a new line.
        If the values provided are correct, then don't output anything for that field.
        When you are finished, print out 5 hyphens '-----' to indicate the end of the text.

        For example, if the 'genus' field and the 'species' field were extracted incorrectly, then you would print:
        genus: Abies
        species: alba
        -----

        Here are the following fields that we have extracted from the institutional label:
        {values_string}

        {ocr_results}

        Here is the image of the institutional label:
    """

The three parts in curly braces ({}) contain data that is spliced into the prompt. These are specific to the specimen sheet whose data is to be corrected. The rest of the prompt is fixed, and is fed to the LLM every time.

The companies that own LLMs often present them as “natural,” “intuitive” and “conversational.” But it is apparent from this example that they are nothing of the sort. In order to get an LLM to provide sensible answers to a question, you need to craft the prompt to carefully and repeatedly instruct it what to do. In this case, the relatively simple instructions are given in extremely clear detail. First the task is described in words, e.g. "If there are errors, then print out the field name with a colon and then the correct value. Each correction is on a new line.". Then exactly this instruction is shown in an explicit example:"genus: Abies". Clearly some effort went into the engineering of this prompt, to ensure that the LLM gave sensible answers. In addition, a SystemMessage is provided to determine the LLM’s output:

system_message = SystemMessage("You are an expert curator of a herbarium with vast knowledge of plant species.")

And the human programmer also begins the AI’s response for it:

ai_message = AIMessage("Certainly, here are the corrections:")

Not only this, but these prompts are provided to the LLM every single time. If 1,000 specimen sheets are corrected, then the LLM is instructed 1,000 times that it is an “expert curator.” It is instructed 1,000 times to put each correction on a new line, with a colon in between the field and the corrected value. It is instructed 1,000 times to begin its answer with the phrase “Certainly, here are the corrections:”. If it corrects 10,000 specimen sheets, it receives these instructions 10,000 times. If it corrects 100,000 specimen sheets, it receives these instructions 100,000 times.

This activity of interacting with a chatbot is nothing like conversing with a research assistant. What it is like is programming a computer. The “English” used in the prompt is not spoken or written, but engineered. It is written in a clear, technical way to elicit a predictable response from a mechanical system. What LLMs have started to enable is the old dream of “natural language programming,” where programmers can become (partly) free of the strictures of formal language. What LLMs have not even come close to doing is replacing programming as a human activity: it is still necessary for the human to design the program, to determine the requirements, to structure the system—even if they write some English as they do so.

Are abstractions high or low?

As an aside, we considered how abstraction is represented in source code. The BaseChatModel is the most abstract kind of chat model in langchain. All chatmodels are instances of BaseChatModel. Here abstraction is metaphorically low. The more abstract something is, the more basic is it. There is a different metaphor for abstraction in ChatPromptTemplate. A Template is more abstract than a simple Prompt, because all Prompts fit into the template. In this case, abstraction is metaphorically empty. The more abstract something is, the less filled it is with content.

These are not the only metaphors for abstraction. As we discussed in the group, abstraction can sometimes also be cloudy or high. The metaphors of emptiness and lowness don’t really work well together, because the lower something is, the more basic it is, then the closer it is to the solid, filled-in, all-too-real ground.

Abstraction is a fundamental concept of computer science, and of the art of programming. Even in relatively straightforward, practical code like llm.py, programmers are obliged to grapple with what abstraction is, and express ideas about its nature, even if they don’t mean to.

How to make a guess

Today we read up to Chapter 3, Frame 37, on page 67 of The Little Learner. You can find the code for the session in the Github repo.

Three steps to learning

So far we have only seen the data structures for deep learning. Today, we started to see how to apply learning algorithms the data.

At a high level, we found that deep learning involves three steps:

  1. Guess the function parameters
  2. Measure the ‘loss’
  3. Improve the guess

You keep doing 1., 2. and 3. until finally the ‘loss’ is nearly $0.0$. At this point, further improvement becomes impossible, and you stop.

What a guess looks like

The first step is the “guess the function parameters.” How do we do this?

For the sake of this chapter, a guess is a list of parameters. Let’s say we are trying to fit a straight line to some data. A straight line can be represented by the following function:

$$ f(x) = wx + b $$

In a machine learning situation, we already know $x$ and $y$ for a large number of entities. What we don’t know is what $w$ and $b$ should be. So, let’s make a guess! We can put $w$ and $b$ together into a list, and call that list $\theta$. In the book, the initial guess was $w = 0.0$ and $b = 0.0$. That results in the following $\theta$:

$$ \theta = (0.0, 0.0) $$

or, in Scheme/Racket:

(define theta (list 0.0 0.0))

That’s it—a guess is a list of numbers. Now the line function is very simple. There are only two parameters, $w$ and $b$. A more complex function may have many parameters, and those parameters may have a very complex structure. In the future, theta may be a very complex variable—a whole collection of tensors all linked together. But that is, of course, the future.

How to calculate the loss

So far we have encountered just one “loss function,” which allows us to measure how good our latest guess it. The loss function works as follows:

  1. Use the current guess for the parameters to produce some predicted y-values for the x-values in the training set
  2. Subtract these predicted y-values from the real y-values that we already know
  3. Square the differences
  4. Take the sum of the squares

This is called the “l2 loss”. The “2” comes from the fact that we square the errors, which is the same as raising them to the power of 2.

The l2-loss is defined as follows in the text:

(define l2-loss
  (lambda (target) ; the function we are trying to 'learn'
    (lambda (xs ys) ; the data: the x-values and y-values we have
      (lambda (theta) ; our current guess for the function parameters
        (let ((pred-ys ((target xs) theta)))
          (sum
           (sqr
            (- ys pred-ys))))))))

Since the target function and xs and ys will stay the same throughout the whole training process, we can fix these in place, producing an “objective function.” The objective function already knows which function we would like to learn parameters for. It already knows what the x-values and y-values are in the training data. It is just waiting for some guesses that it can try out. In the session, we tried the following guesses for $\theta$, using line as the target function, and some simple points as the xs and ys:

; define objective function
(define objective-function (expectant-function line-xs line-ys))

; try first guess
(objective-function (list 0.0 0.0))
; = 33.21

; try second guess
(objective-function (list 0.0099 0.0))
; = 32.5892403

We could keep on trying the algorithm of increasing $\theta_0$ by $0.0099$:

(objective-function (list 0.0198 0.0))
; = 31.9743612

(objective-function (list 0.0297 0.0))
; = 31.3653627

If you perform this operation 106 times, you get the best possible answer:

(objective-function (list (* 106 0.0099) 0.0))
; = 0.13501079999999996

If you keep going, the loss starts to go up again. That is, the guesses get worse:

(objective-function (list (* 107 0.0099) 0.0))
; = 0.13759470000000001

As the authors of The Little Learner would say, after 107 iterations, we have “overshot” the best possible answer.

How to improve the guess

Clearly just increasing the $\theta_0$ by $0.0099$ is not a great algorithm for learning the function. It takes many iterations to learn an extremely simple function. Looking at the graph on page 58, the correct function is clearly something very close to $y = (1)x + 0$. Shouldn’t this be pretty easy to find? And what if we need to learn $b$ as well as $w$? This algorithm assumes that $b = 0$!

Well, we’ve not seen a better solution yet, but there are many chapters to go, and many layers left in this onion. Assuredly by the end of all this, we will know the arcana of the field.

First interlude

Next meeting, we take a break from The Little Learner and will look at the LLM code for HESPI, a fascinating AI-for-GLAM project.

Code recitation as ASMR

The blog has been irregular of late due to my teaching schedule. But we are ploughing on through The Littler Learner. This week we reached the end of “Interlude I: The More We Extend, the Less Tensor We Get.”

The gist

Interlude I concludes our basic introduction to the tensor, the fundamental data structure of deep learning. Chapter 2 covered how to create tensors, and how to use them with the line function to represent parameterised linear equations. “Interlude I” shows how to do arithmetic with tensors.

The joke in the title—”The More We Extend, the Less Tensor We Get,” has to do with the way a tensor’s rank changes when you perform arithmetic on it. The basic arithemetical operations, such as + and /, are defined only for scalar values such as 3 and 708.4. To compute the “sum” or “quoutient” of a tensor or tensors, the concepts of “sum” and “quotient” need to be extended. That’s the first part of the title—”The More We Extend.” The second part of the title refers to the way that a tensor’s rank is affected by arithmetic. For example, if you take the sum of a tensor, its rank is reduced by one. Consider the following example:

$$ sum( \begin{Bmatrix} 3 & 22 \\ 6 & 5 \\ 9 & 4 \end{Bmatrix} ) $$

or, in Scheme (using malt):

(sum (tensor (tensor 3 6 9)
             (tensor 22 5 4)))

This is a tensor2. It has rank 2. Intuitively, it is two-dimensional. The two dimensions are the rows and columns. You can see the number of dimensions in the Scheme/Racket code, because there are two layers of tensor: an outer tensor, and the two inner tensors.

When you compute the sum of this tensor, the rank drops by one, and you get a tensor1 as the answer:

$$ \begin{Bmatrix} 25 \\ 11 \\ 13 \end{Bmatrix} $$

(tensor 25 11 13) ; the answer

If you take the sum of this tensor, the rank falls by 1 again, and you get a tensor0, that is, a scalar:

$$ sum( \begin{Bmatrix} 25 \\ 11 \\ 13 \end{Bmatrix} ) = 49 $$

(sum (tensor 25 11 13))
; =
49

The more we extend—the more “extended” arithmetical operations we apply to a tensor—the less tensor we become—the lower becomes the tensor’s rank.

ASMR

As a group, we have been slowly working out how to actually recite the code in the book. It is remarkably difficult to read code in a natural way. One of the main problems is how literally to read the symbols. For example, try to read this out in English:

; a tensor-3
(tensor (tensor (tensor 3 4)
                (tensor 7 8))
        (tensor (tensor 21 9)
                (tensor 601 1)))

One way is to read the symbols very literally:

Open parentheses tensor. Open parentheses tensor. Open parentheses tensor, 3, 4, close parentheses. Open parentheses tensor, 7, 8, close parentheses. Close parentheses. Open parentheses tensor. Open parentheses tensor, 21, 9, close parentheses. Open parentheses tensor, 601, 1, close parentheses. Close parentheses. Close parentheses.

The problems with this approach are obvious. If you managed to read that blockquote without your eyes glazing over, well done! Your powers of concentration are supreme. The approach that we have been nutting out as a group involves interpreting the symbols a bit more:

A tensor3, consisting of two tensors2, the first tensor2 containing the tensors1 3-4 and 7-8. The second tensor2 contains the tensors1 21-9 and 601-1.

This form of recitation is superior for human comprehension. Why are there all those parentheses, words and digits? Why, to create a tensor3, consisting of two tensors2 … etc. When we read this way, we are essentially “evaluating” the code in the way that the Scheme interpreter evaluates it. We work out what data structure is represented by all the parentheses and other symbols, and try to represent that data structure in a way that works with our hardware—our minds!

One waggish member of the group suggested that code-reading is a kind of ASMR. It is indeed hypnotic to read out so many of these tensors. When we recite a “same-as chart,” which shows the evaluation of a complex expression, the effect is similarly hypnotic. At the beginning of every line of a “same-as chart,” the reader says “which is the same as.” It is not the most intrinsically poetic refrain, but its repetition a dozen times in the space of a few minutes does lull the mind.

Does this “ASMR” quality tell us something about the nature of code as a form of writing? I think it might, but I’m not sure exactly what it tells us. I wonder if there is some useful analogy to be made between code and minimalism in art and music. Like minimalist art and music, code is repetitious, and it is this repetition more than anything that creates the ASMR effect noted by the group.

Where are the ANNs?

One member of the group expressed some—mild and good-natured—frustration at our progress through they book. “I don’t feel like I’ve learned anything about deep learning yet.”

I can understand this point of view, even though it is “wrong” in an obvious way. Another member of the group observed that she has already learned may things about deep learning that she wished she had known earlier. Learning about the data structures used to build artificial neural networks (ANNs) is clearly important, if you want to learn how ANNs are built!

But it is easy to see why a reader at this point in the book might be frustrated, and not feel that they have learned anything yet about ANNs. I think there are two good reasons for this feeling:

  1. The dialogic style of the book does not lend itself to summary and signposting. The authors do not explain why they are introducing certain topics at certain times. The book tries to maintain the Socratic framework in which the teacher (left column) has the entire subject mapped out in advance, and the student (right column) is just happy to follow the reasoning step-by-step without trying to glance at what’s coming next. This is an interesting decision on the authors’ part. Why not just have their student character ask the teacher from time to time why a given topic is important to the overall progress of the book?
  2. So far we have learned about data structures only. The line function allows a linear equation to be represented in a parameterised form, so its parameters can be learned. The tensor structure allows the parameters for many interlinked line functions to be stored in a convenient way, so that lines can be built up into complex networks that model the relationships between observed values in the training data. While data structures are very important, they somehow seem less essential that algorithms in computing. The authors decided it was important to learn the data structures before the algorithms, perhaps because it is impossible to demonstrate the algorithms without having data for the algorithms to work on. This decision to teach the data structures first makes good sense, but it is bound to cause frustration in certain readers, when they expect to be shown the algorithms that make an ANN work.

The Little Learner is a great book. We are making good progress through it, and it is a lot of fun to read aloud. It is an enlivening challenge to learn how to recite the code. It is an enlivening challenge to try to see the bigger picture as it is gradually sketched out piece-by-piece. Next week we do learn an algorithm, one of the most important algorithms of our time: gradient descent. We will have to see where that slippery slope lands us.

Coding in Many Dimensions

This week we read up to page 41 of The Little Learner, and “learned” a bit about the tensor, the “fundamental data structure in deep learning”.

Kurzgesagt

In a nutshell, a tensor is a multidimensional array. You can have one-dimensional tensors (also called vectors):

$$ \begin{Bmatrix} 1.0 & 27.1 & x^2 & -0.7
\end{Bmatrix} $$

You can have two-dimensional tensors (also called matrices):

$$ \begin{Bmatrix} 42 & 42 & 42 \\ 42 & y & 42 \\ 42 & 42 & 42
\end{Bmatrix} $$

You can have three-dimensional tensors. This tensor could represent a (very small) bitmap image, with four pixels, each of which has a colour specified by three values for red, green and blue:

$$ \begin{Bmatrix} \begin{Bmatrix}7 \\ 3 \\ 255\end{Bmatrix} & \begin{Bmatrix}62 \\ 107 \\ 7\end{Bmatrix} \\ \begin{Bmatrix}200 \\ 200 \\ 100\end{Bmatrix} & \begin{Bmatrix}62 \\ 34 \\ 254\end{Bmatrix}
\end{Bmatrix} $$

Each of the numbers in a tensor is called a scalar. Presumably this name comes about because of the way scalar multiplication works. If you multiply a tensor by a single number, then you simply multiply all the members of the tensor by that number:

$$ \begin{Bmatrix} 42 & 42 & 42 \\ 42 & y & 42 \\ 42 & 42 & 42
\end{Bmatrix} \times 6 = \begin{Bmatrix} 42 \times 6 & 42 \times 6 & 42 \times 6 \\ 42 \times 6 & y \times 6 & 42 \times 6 \\ 42 \times 6& 42 \times 6 & 42 \times 6
\end{Bmatrix} $$

Thus the number $6$ “scales” the tensor, and can be called a scalar.

(I have just read the relevant Wikipedia page to verify my conjecture about the etymology of “scalar,” and I’m not totally wrong. The term originally derives from the Latin for “ladder.”)

(The derivation of terms in linear algebra is always interesting. As mentioned above, a two-dimensional tensor is normally called a “matrix.” Believe it or not, this word derives from the Latin for “womb”!)

Scheme doesn’t natively support tensors. It only natively supports one-dimensional arrays. The malt package devised for The Little Learner adds in support for tensor operations. The key rule of tensors is that they need to have a consistent shape. This is not a valid tensor, for example:

$$ \begin{Bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & & 9 \\ 10 & & \end{Bmatrix} $$

The “second” or “outer” dimension is okay: there are three columns. But the “first” or “inner” dimension is no good: each column has a different number of rows:

$$ \begin{Bmatrix} 1 \\ 4 \\ 7 \\ 10 \end{Bmatrix} \begin{Bmatrix} 2 \\ 5 \end{Bmatrix} \begin{Bmatrix} 3 \\ 6 \\ 9 \end{Bmatrix} $$

The malt package enforces this rule. If you try to create this faulty tensor with mismatched columns, you will get an error:

(tensor (tensor 1 4 7 10)
        (tensor 2 5)
        (tensor 3 6 9))

; tensor: Mismatched shapes: (#(1 4 7 10) #(2 5) #(3 6 9))

Some functions

We learned a few useful malt functions in the first part of the chapter.

  • scalar?: is this object a scalar?
  • tensor?: is this object a tensor?
  • shape: what is the shape of this tensor? How many rows, columns, aisles etc. does it have? What is the length of each of its dimensions? For example, if you have a full-colour bitmap image in 1800 x 1169 resolution, and asked for (shape my-big-image), you would get '(1800 1169 3) as the answer: 1800 columns, 1169 rows and 3 colour values.
  • tlen: what is the size of the outer dimension of this tensor? E.g. if you have a matrix with 6 columns and 3 rows, the tlen of the matrix will be 6. You can use tlen to implement the shape function.
  • rank: how many dimensions a tensor has. For instance, your typical spreadsheet is a two-dimensional tensor, made up of rows and columns. If you had your spreadsheet inside a Racket program, and typed (rank my-spreadsheet), out would pop the answer 2.
  • tref: how to get items out of a tensor. For instance, if you wanted the third column of a tensor called my-spreadsheet, you could write (tref my-spreadsheet 2).

Our upward course…

Sometimes scalars and tensors can be hard to tell apart. For example, here is a scalar:

$$ 6 $$

In Scheme/Racket, that would be:

6

Here is a one-tensor:

$$ \begin{Bmatrix}6\end{Bmatrix} $$

Or in Scheme/Racket:

(tensor 6)

Now here is a two-tensor, or matrix, with one row and one column:

$$ \begin{Bmatrix}\begin{Bmatrix}6\end{Bmatrix}\end{Bmatrix} $$

(tensor (tensor 6))

And a three-tensor…

$$ \begin{Bmatrix}\begin{Bmatrix}\begin{Bmatrix}6\end{Bmatrix}\end{Bmatrix}\end{Bmatrix} $$

(tensor (tensor (tensor 6)))

And a four-tensor…

$$ \begin{Bmatrix}\begin{Bmatrix}\begin{Bmatrix}\begin{Bmatrix}6\end{Bmatrix}\end{Bmatrix}\end{Bmatrix}\end{Bmatrix} $$

(tensor (tensor (tensor (tensor 6))))

And a five-tensor…

$$ \begin{Bmatrix}\begin{Bmatrix}\begin{Bmatrix}\begin{Bmatrix}\begin{Bmatrix}6\end{Bmatrix}\end{Bmatrix}\end{Bmatrix}\end{Bmatrix}\end{Bmatrix} $$

(tensor (tensor (tensor (tensor (tensor 6)))))

What is a dimension anyway? We may leave the last words to A. Square:

And once there, shall we stay our upward course? In that blessed region of Four Dimensions, shall we linger on the threshold of the Fifth, and not enter therein? Ah, no! Let us rather resolve that our ambition shall soar with our corporal ascent. Then, yielding to our intellectual onset, the gates of the Sixth Dimension shall fly open; after that a Seventh, and then an Eighth— How long I should have continued I know not. In vain did the Sphere, in his voice of thunder, reiterate his command of silence, and threaten me with the direst penalties if I persisted. Nothing could stem the flood of my ecstatic aspirations.