Silicon Picks and Transistor Shovels: The Infrastructure Behind the AI Gold Rush

Justin Stevens

Founder & CEO, Overlap Holdings

Published Sep 27, 2023

Happy fall, everyone!

The pace has really picked up behind the scenes at Overlap—we’ve been heads down in execute mode and look forward to sharing some fun announcements soon. In the meantime, here’s an excellent (and timely) explainer Gauri wrote about AI and the hardware it relies upon.

This post is the first of a two-part series on the infrastructure needed to handle the increase in Artificial Intelligence demand, and how the market is beginning to address it (and why Nvidia’s stock price keeps going up…). We hope you enjoy it as much as we did.

We’ve seen the articles. The Twitter threads. The memes. The many, many daily reminders that the AI revolution is here. How much of this is hysteria vs. reality? It's hard to tell. But parts of this shift are already a foregone conclusion: Don't expect your 8th grader to be writing their own book report anytime soon.

When we think of Artificial Intelligence, we typically think of software. However, AI applications bring with them a massive surge in computational needs, which necessitate more (and different kinds of) processing power than was needed previously. This increase will require hardware innovation to ripple through the data centers that underpin our information economy—and no matter which software developers win the AI gold rush, the companies creating these hardware “picks and shovels” stand to make a fortune.

But before we make our way to the kitchen, let's start with the front of the house.

What do people mean when they talk about Artificial Intelligence?

The phrase “AI” elicits images of a mechanized mind that can understand and perform meaningful, complex tasks through simple speech or text recognition. This popular notion of AI challenges what we’ve always been taught—that computers have no intelligence of their own—but the truth is a lot more nuanced.

Artificial Intelligence can take many different forms, and there are entire domains of academic study and literature that deal with these differences. The types of AI driving the current increase in practical use cases and hype are known as Large Language Models (LLMs). These are essentially algorithms that have been fed massive volumes of written content, and which use a mixture of pattern-matching and statistics to provide answers to questions. (In the case of AI-based art apps, they’re fed massive amounts of visual content and use pattern-matching and statistics to provide representative images.)

The most well-known of these LLMs is ChatGPT, which—through its parent company, OpenAI—has pioneered many of the advancements that have made these models useful for real-life applications. As the craze has continued, firms such as Google, Microsoft, and even Snapchat have integrated LLM features into their offerings.

To be clear, these AI applications aren't sentient. While they may respond like humans, they don't think the way we do, and they don't actually know what they're saying. Instead, they've been programmed to take your queries, review them across millions (if not billions) of other queries available on the Internet, and determine the highest probability response, blending both the information you're requesting and the optimal grammar/language/syntax to convey the answer.

The results weren’t always so convincing. Earlier models of GPT had a clunky grasp on the intricacies of language and did not typically provide functional responses. However, after many years of throwing dollars and brains at the problem, the majority of outputs have gotten quite useful—almost to the point of seeming, well, human.

The complexity behind the simplicity

Unfortunately, while newer LLMs are easy to interact with, each response provided by these systems is highly computationally intensive. As a useful segue to this next part of the post, we've asked ChatGPT itself to explain why.

Imagine that ChatGPT is like having a friendly, highly knowledgeable librarian at a small-town library. This librarian, while incredibly smart and able to help with a wide range of general topics, has limited space. They can engage in in-depth conversations with you, understand context, and provide thoughtful responses, but they're focused on one conversation at a time.

On the other hand, the Google search engine is like a vast, state-of-the-art research library in a bustling metropolis. This library has an immense collection of books, articles, and resources on almost every topic imaginable. It can handle an enormous number of people simultaneously, each searching for their specific needs, and provide them with a vast array of information instantly. However, it might not engage in in-depth conversations the way our friendly small-town librarian does.

—ChatGPT

Because of this difference in the depth of the “conversation,” a single query in ChatGPT requires much more computational power than a single Google search. Breaking it down further, this is what happens when you ask an LLM a question:

Your input query is sent to servers that host the language model (also known as its “neural network” 🧠);
The input is broken down (or “tokenized,” in industry terms) into sentence fragments and individual words and even sub-words, so that the model can then search for those words and concepts (the “tokens”) across its database of content. It needs to “understand” that when it sees certain groups of words in a row, they may mean a singular concept. For instance, it needs to correctly parse the difference between a query for “how to find remote work” and “why won't my remote control work?”;
The model performs a significant amount of computation on these tokens, usually by assigning each of them to a different “processing core” so that the work can be done in parallel (remember that last part—we'll be talking more about it down below 👇). It analyzes the word(s) and context, and tries to provide an answer similar to ones provided for similar questions in its available database. This involves a massive number of complicated mathematical operations with names such as “matrix multiplications” and “non-linear activations”—all of which is a fancy way of saying statistics and probabilities;
The model then takes these “answers” it has found in its database and structures them back into a plain-language response you can understand.

Of course, there are a ton of additional complexities under the hood. For insurance, ChatGPT doesn’t just give you the same canned response every time. Asking the exact same question multiple times will yield multiple responses because a slight amount of randomness is intentionally baked into the process—to provide a more varied speech pattern that better mimics the way humans speak, and to avoid unhelpful, circular responses (those who want to dive in further can check this out). Safeguards are also put in place to ensure that the system doesn’t produce vulgar or offensive responses.

The complexities of the process add to its computational intensity compared to a Google search, which (relatively “simply”) retrieves existing web pages based on keyword matching from a database that is maintained, ranked, and updated in the background. While Google's search infrastructure is extremely powerful and handles a huge number of queries, each individual query tends to be computationally MUCH less intensive than what's required for Large Language Models. A report published earlier this year estimates that the cost to Google or Microsoft of processing an individual AI query is up to 10x the cost of a standard web search.

Let's go back to our earlier analogy. Imagine the small-town librarian representing ChatGPT becomes incredibly well-known—so well-known, in fact, that everyone starts calling the library to ask any question they might have. Sounds like the system would get overwhelmed pretty quickly, right? The library would probably have to staff up with additional librarians (who, for the sake of this analogy, represent computer processors and memory); phone lines (routers, switches, and other input/output hardware); coffee for its stressed-out librarians (power storage and distribution); and a long list of ancillary needs.

This, essentially, is what is happening to data centers right now. As more and more searches get rerouted from the big-city library to the small-town one, the infrastructure is trying hard to keep up.

Behind the rack

Now that we’ve established the nature of these infrastructure needs, let’s take a step behind the metaphorical curtain and look at how a data center gets your question answered.

Each time you interact with the internet, it pings a server in a data center somewhere to respond to your request. Servers consist primarily of processors, memory (”storage”), a network interface, and a power supply. The part of the server that actually “does the work” to complete your request is the processor: When people talk about a computer chip or a semiconductor, this is the part of the system they’re talking about.

If you were to picture a data center as a restaurant kitchen where all the orders (information requests, or “input”) are received and completed, the processor would be the chef who actually prepares the dishes (responses) requested by the customers (users). The other components in the server, such as memory and network interface, are the staff who support the chef. They bring the chef the necessary ingredients (data) and deliver the completed dishes (“output”)—but there would be nothing to serve without the chef.

Processors (or semiconductors) are far and away the most expensive and complex part of the system, and we’ll be focusing on them for the rest of this post.

A processor has all of the circuitry responsible for interpreting and executing instructions. Given the higher level of computation required by AI applications, as discussed above, the biggest driver of profits and innovation today is migrating the industry toward processors that can better handle this increased load.

Getting even further into the weeds (trust me, it’s important), there are two main types of processors:

CPUs (short for “Central Processing Units”); and
GPUs (short for “Graphics Processing Units”—which, as we’ll explain in a minute, is a bit of an outdated term)

You’re probably mostly familiar with CPUs from everyday life—this is the type of chip that made Intel famous, and which powers the bulk of the processes on your personal computer.

Back in the day, CPUs could only do one calculation at a time, as all processes had to run through a single location, or “core.” These days, most CPUs have multiple cores, enabling the processor to run multiple calculations in parallel and complete tasks faster. This idea of “parallel processing” is the foundation of high-performance computing. Entire fields of engineering are focusing on increasingly clever ways to break up instructions in a manner that increases the speed at which this parallel processing can be achieved, and delegated appropriately.

The methodology for breaking up these instruction sets and the “language” (i.e., code) used to communicate these instructions have all been improved through processes with impressive-sounding names like “hyperthreading.” You’ve probably seen the headlines about the recent IPO of the British chip designer Arm (a Softbank portfolio company), whose big breakthrough over traditional processors was the ability to utilize a simplified language with reduced instructions to communicate a smaller set of common commands to a processor. Such an architecture results in reduced energy usage and size requirements in exchange for a smaller overall level of functionality—a great tradeoff for a smartphone CPU, and one that has led to outsized market share in that industry for Arm (and, ultimately, a large IPO at a nice premium for its largest shareholder, Softbank).

Now, CPUs are general-purpose processors—they can do pretty much everything, from playing movies to crunching spreadsheets to web browsing, and can bounce between tasks relatively quickly. The tradeoff of this ability to handle diverse processes is that they’re not great at parallel processing. As a result, a new type of processor was created that could tackle a narrower set of tasks with a much greater ability to run parallel processes and complete those tasks very quickly. This is where the GPU comes in.

GPUs were originally built to render graphics for computer video games. They had their first heyday in the ‘90’s, when 3D-rendered action video games such as “Doom” made their debut. As any gamer will tell you, the speed at which complex graphics are delivered to the monitor is immensely important. GPUs have thousands of cores, allowing them to process image pixels in parallel and thus leading to a reduction in latency, or the time it takes to process a single image. By comparison, since CPUs have much fewer cores (at most, 64 for the highest-end consumer hardware), they cannot process anywhere near the same number of pixels in parallel. So, in a typical PC build, the CPU hands off the responsibility of what to display on the monitor to the GPU—the key component of what is known as the “graphics card,” whose sole focus is to process a game’s (or other application’s) graphics.

Until recently, the GPU market was primarily driven by graphics cards; the first company to mass-market GPUs was Nvidia, which had a strong product position but was primarily viewed as a niche player. Then Large Language Models started reaching the mainstream market, and the GPU developers found a massive new use case. Turns out, the ability for a GPU to efficiently parallel-process information made it a good fit for these models, given the complexity of the typical ChatGPT query (as discussed above, and in more detail below).

Let’s say you ask an AI search engine the following: “Find me some steep discounts at family-friendly hotels for Columbus Day weekend where I can use my frequent flyer miles.” In order to give you a response quickly, such a model needs to break that question into multiple parts and run down the meanings of each of them. Taking just a subset of the necessary parallel processes:

It needs to figure out when Columbus Day weekend falls
It needs to determine which hotels are considered “family-friendly”
It needs to understand what a “steep discount” is and apply that against the aforementioned hotel criteria
It needs to determine what a frequent flyer mile is—and better yet, if you have any, and with which airline/hotel programs

None of these are necessarily difficult answers to find on the internet, but they’re all pretty different. Given the speed at which we want to receive the answer to this query, it would be much better if these various questions could be handled in parallel, by different processor cores. This type of information gathering is where a GPU shines.

As a result, the GPU—previously the CPU’s overlooked sibling—has seen a drastic surge in popularity as Large Language Models have become mainstream.

This post by Jerry Chen (a partner at the VC firm Greylock), “The New New Moats,” is a great read on moats in the context of AI and highlights the value created by GPUs:

We can see the platform shift to AI through the lens of the financial performance of NVIDIA, the primary provider of GPUs, versus Intel, the primary provider of CPUs. In 2020, NVIDIA overtook Intel as the most valuable provider of chips. In 2023, the company hit a trillion-dollar valuation.

Credit: **The New New Moats, Jerry Chen**

While GPUs have revolutionized AI computation, their singular focus on parallel processing has certain limitations—for instance, limited memory and the consumption of a ton of energy. In addition, the ease at which they are able to disperse a workload across multiple cores makes it harder for them to centralize the results of these calculations, which can cause latency if/when they need to communicate information to other processors.

A whole ecosystem of companies is now developing novel hardware and chips to solve for these challenges. Some of them are focused on improving the performance of the GPU itself, like Cerebras—a semiconductor startup that built the world’s largest computer chip in 2019 (roughly the size of an iPad). Others are developing novel processors meant specifically to serve AI applications, such as Graphcore, which has developed an “Intelligence Processing Unit” with machine-learning pathways built directly into the processor, abstracting away much of the time a GPU would typically spend figuring out what step to do next. Companies like these will determine what the next generation of computing architecture looks like, with many fortunes made or lost along the way.

Once a novel technology has been developed and customers secured, the next challenge is developing the manufacturing infrastructure to fulfill demand. Given that we’re dealing with semiconductor chips with part sizes on an atomic scale, the technology required of a manufacturing facility itself must be quite advanced to keep up with chip innovation. We’ll dive into that topic in detail in next month's piece, where we will discuss the manufacturing landscape for semiconductors, how it will (and has) evolved over time, and why these could even become matters of national security.

The Overlap

826 followers

Devin Banerjee

Sr. Managing Editor, Industry News & Community @ LinkedIn

10mo

Fantastic explainer — thank you, Overlap team. Excited to read the next edition on chip innovation and national security, another complex topic!

2 Reactions

Gwen Cheni

Building in AI Bio | ex-Khosla, IndieBio, Goldman, UCSF, Yale

10mo

Awesome! Thanks Justin! 👍

2 Reactions

Sam Kramer

Associate at Heritage Holding

10mo

Amazing article @Gauri Jaswal!

3 Reactions

See more comments

To view or add a comment, sign in

See all

Silicon Picks and Transistor Shovels: The Infrastructure Behind the AI Gold Rush

Justin Stevens

Founder & CEO, Overlap Holdings