Everything you ever wanted to know about LLMs but were afraid to ask

Jul 16

Introduction

There is an intimidating aura around AI and LLMs for sure, with many viewing tools like ChatGPT as a magical black box that even PhD-wielding computational linguists are unable to crack.

I am here to dispel the notion that you need an advanced degree–or even any degree–to understand roughly how this in-vogue technology works.

That may sound weird coming from me, who actually has an advanced degree in the related field of Philosophy of Language, and who has studied disciplines like Linguistics, Psycholinguistics (where neural node models of language understanding originated), Mathematical Logic, and who has even dabbled in natural language processing projects using Python’s NLTK module (for example, using WordNet to identify semantically anomalous instances of words in text in order to identify metaphors and determine their meaning).

But it shouldn’t sound too weird, because I will be the first to admit that over the past few years I have–probably because of latent bitterness at never having made it in the language industry–slept on learning about this technology in favor of working on other projects. But the curiosity–and guilt at having ignored something that is both core to my interests and at the center of the most important technological revolution in human history–has mounted to a point that I can no longer ignore it.

I never bothered to learn how a television works. Even as a child I always felt guilt about this. Before me sits the television, a seemingly magical thing core to my identity and lifestyle which has brought me immense joy and profoundly shaped my personality–and it’s a literal black box. I doubt I’ll ever bother learning about the TV, but I can no longer delay learning about LLMs. Let’s dive in.

Tokenization

Let’s suppose we feed a natural language processing system some bit of language (for example, a sentence).

“The dog barked”

sentence

The first thing the system must do is break the input up into its constituent parts, its atomic elements. This process is called tokenization and the individual elements are called tokens.

We naturally think of these atomic elements as the individual words within the sentence, but there are sub-lexical units (for example, prefixes and suffixes, and conjugation and declension morphemes) and non-lexical units (for example, punctuation and white space) as well. For the sake of simplicity, we will pretend that all atomic elements are words–just remember that there’s a little bit more to the story than that.

Alright, so we input our sentence into the system and it breaks it up into a list of the elements.

“The dog barked” → [“the”, “dog”, “barked”]

sentence tokens

A simple tokenization algorithm might simply split a piece of language on all the white spaces.

Wow, NLP is a breeze so far!

You might think at this point that it is important for the system to do syntactic classification on the tokens, that is, identify the so-called parts of speech of each of the tokens in the context of the sentence. For example, identifying or tagging “the” as a definite article, “dog” as a noun, and “barked” as a verb. Indeed, this is a fun problem, and old-school NLP toolkits like Python’s NLTK have cool tools for doing syntactic classification, but as far as I can tell, systems like ChatGPT do not do any kind of syntactic classification. Syntax–which is essentially one half of classical linguistics–can be ignored wholesale. Sorry, syntax lovers.

Static embeddings

The good news for syntax lovers is that they are not alone. The other half of linguistics is called semantics—it concerns itself with meaning—and it too can largely be ignored. I’ll explain this below.

Our system now has our input sentence tokenized. What the hell happens next?

Well, algorithms prefer working with numbers than words. Numbers have fun properties like being greater than, being less than, and being equal to. So the system converts the tokens to vectors, which are just ordered lists of numbers. These vectors are called embeddings.

Don’t get scared by the language, it’s just fancy mumbo jumbo. What’s important to know is that these vectors can be as long or as short as the engineer who built the system wants, but generally larger vectors will mean a more powerful system. The length of the vector is called the dimension of the vector. Different systems use different sized vectors (for example 512, 1536, or 3072), and the size will usually be divisible by 8 because, well, computers.

How does the system convert the tokens to the vectors? Well, the system has a list of all possible tokens, called the vocabulary, and an associated vector for each token, called the static embedding of that token.

For example, somewhere in the system’s memory you have a lookup table like this:

Vocabulary Static Embedding

“the” <0.1, 0.8, 0.2, 3.7, 0.3>

“a” <0.4, 5.1, 0.2, 0.5, 0.3>

“dog” <0.2, 0.1, 0.2, 0.5, 0.8>

“cat” <0.4, 0.9, 1.2, 0.5, 8.8>

“barks” <5.5, 2.1, 0.2, 0.0, 1.3>

“meows” <0.1, 0.1, 0.2, 6.0, 1.3>

“bat” <0.3, 0.6, 1.9, 0.4, 0.8>

You can think of these static embeddings as the “default meaning” of each token.

What happens when a model is “trained” is that lots of data is analyzed and used to refine the numbers in these vectors such that words with similar meanings end up being close to each other in the vector space. Just like in high school math class where you learned about the lengths of line segments on the Cartesian coordinate plane (for example, what is the length of line segment formed by points (0, 0) and (3, 4)?), there is an analogous notion of distance between points in the N-dimensional vector space used by the model.

If vectors represent meaning, we’d expect the distance between the vector representing “dog” and the vector representing “cat” (both common household animals) should be shorter than the distance between the vector representing “chair” and the vector representing “joy” (a piece of furniture and an emotion).

If vectors represent meaning, why did I claim that semantics gets ignored by LLMs? Only because the conventional picture of meaning coming out of linguistics and philosophy of language does not conceptualize meaning like this. For LLMs, meaning is something like a vector of weights in an N-dimensional space, with no actual interpretation of what each of those dimensions represents, nor even an answer to the question of how many dimensions is correct or best.

Rather, the classical picture of meaning thinks of meaning as something like a mapping to objects or properties in the world, or to concepts or semantic primitives in the mind. Further, classical semantics typically tries to capture and explain the notion of truth, providing a picture of how our words and other representations (for example, images) map on to reality and provide an accurate picture of it. We get none of that with LLMs. In some sense, LLMs are underpinned by a conception of meaning of a token as its relation to all other tokens across the entirely of language (or across the dataset it was trained on). That’s not a completely revolutionary idea, however, as some philosophers of language, psycholinguists, and academics in related fields proposed similar ideas long before the advent of LLMs. In fact, the approach to NLP underlying systems like ChatGPT is called the “neural approach” precisely because it got its start as a theory of how the human brain actually processes language.

Before we move on I want to underline the fact that the lookup table of tokens and static embeddings is completely fixed. They are not things that change when a user enters a prompt. You can think of them as existing in the system’s read-only memory (ROM), not its random access memory (RAM). Once a model is trained and deemed sufficient, it is frozen, so to speak, and the vocabulary and static embeddings are locked. They become the starting point for everything else that happens downstream when clients enter prompts and interact with the system.

Transformers and contextualized embeddings

Our system has now converted our input sentence into a list of tokens and mapped each of those tokens to the fixed static embedding in its read only memory. Nothing mysterious has happened yet.

“The dog barked” → [“the”, “dog”, “barked”] → [the_se, dog_se, barked_se]

sentence tokens static embeddings

The true black magic happens next. Our static embeddings are transformed into context-sensitive embeddings. (This notion of transformation is precisely the “T” of “GPT”.)

The static embedding that are looked up in the fixed data store are essentially just “first pass” meanings. They are like entries in a dictionary (in fact, that’s exactly what they are). If you wanted to understand a sentence, you could look up each individual word in a dictionary and string those definitions together, but it usually wouldn’t result in a good interpretation of the sentence.

Rather, when words appear together, depending on their context (for example, their positions relative to one another, the punctuation marks that surround them, etc.), they take on subtly different meanings. For example, the word “bat” could refer to (1) an animal, (2) a piece of baseball equipment, or (3) the physical act of moving one’s hand in a swatting motion. This is the standard example of lexical ambiguity, but even non-obviously ambiguous words take on different meanings in different contexts, as is clear to most people who have reflected on language.

This is where transformers come in. A transformer is essentially just an algorithm that converts standard embeddings into context-sensitive embeddings, and it does so for each token by looking at the other tokens around it (this is what we mean by “context-sensitive”). It transforms the standard embeddings on our tokens into context-sensitive embeddings.

Transformers use a process called self-attention to contextualize the standard embeddings, that is, convert them into context-sensitive vector representations.

“The dog barked” → [“the”, “dog”, “barked”] → [the_se, dog_se, barked_se] → [the_ce, dog_ce, barked_ce]

sentence tokens static embeddings contextualized embeddings

To use our example of “cat”, “dog” and “bat” from earlier, at the static embedding stage, “bat” might not be very close to “cat” and “dog” in the vector space, given that it is an ambiguous token whose meaning only partly or fractionally corresponds to a mammalian animal.

But in a sentence like “The bat flew out of the cave”, after contextualization (AKA transformation), you’d expect the vector representation to be closer to “cat” and “dog”, because it has been transformed into something representing a mammalian animal.

Transformation and the related concept of self-attention seem to me to be some of the more complex aspects of the LLM system design, and I don’t understand them too well yet, so I will save discussion of them for a future post.

What can we do with contextualized embeddings?

Now that we have contextualized embeddings, we’re done.

Okay, not exactly.

But we are basically done, because all of the hard work is done. We have converted some input text into a context-sensitive vector representation of its meaning.

Once you have contextualized embeddings, the hard part of understanding is done.

What follows is about putting that understanding to use—generation, classification, comparing, and reasoning.

Generation is the most common and famous use case. This is what ChatGPT does. Basically, it takes the context-sensitive vector representation of whatever you input and uses statistical analysis to generate the next word. It repeats this process over and over, generating one word at a time, until it generates an output.

Classification is even easier. Once the input is converted into a context-sensitive vector representation, a downstream system can analyze that representation and classify it according to sentiment (for example, is this review positive or negative?), or topic (for example, if this article about science, politics, or art?).

Comparing involves taking your context-sensitive vector representation and searching for similar ones (for example, find pieces of text similar to those written by Nabokov, or find images similar to this one). Notice the use of “find” in the previous sentence, rather than “create” or “generate”.

Reasoning is perhaps the most interesting. Remember when I said earlier that the “neural” approach to language underpinning LLMs contrasted with classical pictures of meaning. Well, the same goes for reasoning. In classic philosophy of language, we model human reasoning using formal logical systems (so-called formal logics). There are vanilla systems like Propositional Logic, First-Order Predicate Logic, and Modal Logic, which you may have learned about if you ever took a symbolic logic course. But there are also exotic systems like Temporal Logic, Fuzzy Logic, Illocutionary Logic, Indexical Logic, and Plural Logic. These systems are predicated upon the so-called “symbolic” approach to language modelling, where human language and reasoning is modeled using well-defined syntaxes and semantics, with explicit mechanisms for defining and determining truth and falsity, along with explicit reasoning rules. There is no such inference engine, nor any notion of truth or falsity, baked into LLMs.

LLMs appear to do a good job at simulating reasoning precisely because the human language that they are trained on was generated by minds that demonstrated reasoning. But if you were to, say, train a model on a bunch of invalid arguments, or a massive corpus of text that had no instances of Modus Ponens, the resulting model would not appear to be very reasonable. The LLM can take the premises “Socrates is a man” and “All men are mortal”, and generate “Socrates is mortal” only because this kind of syllogistic pattern can be found very frequently in the corpora it was trained on.

From what I can tell, one of the hot topics in AI research is the question of how to combine the power of the “neural” approach to language that underpins LLMs with the inferential and truth-preserving power of the “symbolic” approach to language that underpins classical formal logical system. That’s a very interesting question to me.

Conclusion

That about does it. As far as I can tell, what I presented above are the broad strokes of how LLMs work. I am happy to have dug in a bit and dispelled for myself the notion that LLMs are impossible to understand. I am excited to learn more. If I got anything wrong, or you just want to chat about this stuff, please let me know.

Shane Sicienski

Everything you ever wanted to know about LLMs but were afraid to ask

Hitstop in Capcom Beat ‘Em Ups