How Will A.I. Learn Next?

As chatbots threaten their own best sources of data, they will have to find new kinds of knowledge.

October 5, 2023

Illustration by Vivek Thakker

The Web site Stack Overflow was created in 2008 as a place for programmers to answer one another’s questions. At the time, the Web was thin on high-quality technical information; if you got stuck while coding and needed a hand, your best bet was old, scattered forum threads that often led nowhere. Jeff Atwood and Joel Spolsky, a pair of prominent software developers, sought to solve this problem by turning programming Q. & A. into a kind of multiplayer game. On Stack Overflow—the name refers to a common way that programs crash—people could earn points for posting popular questions and leaving helpful answers. Points earned badges and special privileges; users would be motivated by a mix of altruism and glory.

Within three years of its founding, Stack Overflow had become indispensable to working programmers, who consulted it daily. Pages from Stack Overflow dominated programming search results; the site had more than sixteen million unique visitors a month out of an estimated nine million programmers worldwide. Almost ninety per cent of them arrived through Google. The same story was playing out across the Web: this was the era of “Web 2.0,” and sites that could extract knowledge from people’s heads and organize it for others were thriving. Yelp, Reddit, Flickr, Goodreads, Tumblr, and Stack Overflow all launched within a few years of one another, during a period when Google was experiencing its own extraordinary growth. Web 2.0 and Google fuelled each other: by indexing these crowdsourced knowledge projects, Google could get its arms around vast, dense repositories of high-quality information for free, and those same sites could acquire users and contributors through Google. The search company’s rapacious pursuit of other people’s data was excused by the fact that it drove users toward the content it harvested. In those days, Google even measured its success partly by how quickly users left its search pages: a short stay meant that a user had found what they were looking for.

All this started to change almost as soon as it had begun. Around that time, Google launched the OneBox, a feature that provided searchers with instant answers above search results. (Search for movie times, and you’d get them in the OneBox, above a list of links to movie theatres.) The feature siphoned traffic from the very sites that made it possible. Yelp was an instructive case: Google wanted to compete in the “local” market but didn’t have its own repository of restaurant and small-business reviews. Luther Lowe, Yelp’s former head of public policy, told me recently that Google tried everything it could to claw its way in, from licensing Yelp’s data (Yelp declined) to encouraging its own users to write reviews (no one wanted to contribute at the time) or even buying Yelp outright (it declined again). “Once those strategies failed—license, compete on the merits, purchase the content—what did they have left?” Lowe said. “They had to steal it.” In 2010 and 2011, Lowe says, Yelp caught Google scraping their content with no attribution. The data gave Google just enough momentum to bootstrap its own reviews product. When Yelp publicly accused Google of stealing its data, the company stopped, but the damage had already been done. (A similar thing happened at a company I once worked for, called Genius. We sued Google for copying lyrics from our database into the OneBox; I helped prove that it was happening by embedding a hidden message into the lyrics, using a pattern of apostrophes that, in Morse code, spelled “RED HANDED.” Google won in appellate court, in the Second Circuit. Genius petitioned the Supreme Court to hear the case, but the court declined.)

In 2012, Google doubled down on the OneBox with a redesign that deëmphasized the classic blue links to external Web sites in favor of Google’s own properties, like Shopping and Maps, and immediate answers culled from sites like Wikipedia. This made Google even more convenient and powerful, but also had the effect of starving the Web of users: instead of a search leading you to a Wikipedia page, say, where you might join the small percentage of visitors who end up contributing, you’d get your answer straight from Google. According to Lowe, on pages of search results featuring the new design, as many as eighty per cent of searchers would leave without ever clicking on a link. Many Web 2.0 darlings, dense with user-generated content, saw visitor numbers decline. It was around this time that, in some sense, the quality of the Web as a whole began to decline, with the notable exception of the few crowdsourced knowledge projects that managed to survive. There’s a reason that appending “reddit” or “wiki” to search terms has become an indispensable productivity hack: in a hollowed-out Web overrun with spammers and content farms, these have become some of the last places where real, knowledgeable humans hang out.

Today, large language models, like OpenAI’s ChatGPT and Google’s Bard, are completing a process begun by the OneBox: their goal is to ingest the Web so comprehensively that it might as well not exist. The question is whether this approach is sustainable. L.L.M.s depend for their intelligence on vast repositories of human writing—the artifacts of our intelligence. They especially depend on information-dense sources. In creating ChatGPT, Wikipedia was OpenAI’s most important data set, followed by Reddit; about twenty-two per cent of GPT-3’s training data comprised Web pages linked to and upvoted by Reddit users. ChatGPT is such a good programmer that the savvy developers I know aren’t using Stack Overflow anymore—and yet it’s partly by studying Stack Overflow that ChatGPT became such a good programmer. Recently, a group of researchers estimated that the number of new posts on Stack Overflow has decreased by sixteen per cent since the launch of ChatGPT.

I’m not a Stack Overflow power user, but I am a coder, and I’ve relied on the site for more than a decade. I’ve submitted projects to GitHub (a site for open-source code), posted on Reddit, and edited Wikipedia pages. Meanwhile, I’ve published blog posts and code to my Web site for years. Like everyone else, I didn’t suspect that I was producing GPT fodder; if I’d known, I might have asked for something in return, or even withheld my contributions. In April, the C.E.O. of Reddit announced that, from then on, any company that required large-scale data from its site would have to pay for the privilege. (Because the move threatened other, non-A.I.-related apps, Reddit users responded by “blacking out” huge swaths of the site, emphasizing that the company’s fortunes depended on uncompensated community contributions.) Stack Overflow has made a similar announcement.

Maybe the crowdsourcing sites will manage to wall off their content. But it may not matter. High-quality data is not necessarily a renewable resource, especially if you treat it like a vast virgin oil field, yours for the taking. The sites that have fuelled chatbots function like knowledge economies, using various kinds of currency—points, bounties, badges, bonuses—to broker information to where it is most needed, and chatbots are already thinning out the demand side of these marketplaces, starving the human engines that created the knowledge in the first place. This is a problem for us, of course: we all benefit from a human-powered Web. But it’s also a problem for A.I. It’s possible that A.I.s can only hoover up the whole Web once. If they are to continue getting smarter, they will need new reservoirs of knowledge. Where will it come from?

A.I. companies have already turned their attention to one possible source: chat. Anyone who uses a chatbot like Bard or ChatGPT is participating in a massive training exercise. In fact, one reason that these bots are provided for free may be that a user’s data is more valuable than her money: everything you type into a chatbot’s text box is grist for its model. Moreover, we aren’t just typing but pasting—e-mails, documents, code, manuals, contracts, and so on. We’re often asking the bots to summarize this material and then asking pointed questions about it, conducting a kind of close-reading seminar. Currently, there’s a limit to how much you can paste into a bot’s input box, but the amount of new data we can feed them at a gulp will only grow.

It won’t be long before many of us also start bulk-importing our most private documents into these models. A chatbot hasn’t yet asked me to grant it access to my e-mail archives—or to my texts, calendar, notes, and files. But, in exchange for a capable A.I. personal assistant, I could be tempted to compromise my privacy. A personal-assistant bot might nudge me to install a browser extension that tracks where I go on the Web so that it can learn from my detailed searching and browsing patterns. And ChatGPT and its ilk will soon become “multimodal,” able to fluidly blend and produce text, images, videos, and sound. Most language is actually spoken rather than written, and so bots will offer to help us by transcribing our meetings and phone calls, or even our everyday interactions.

Before models like GPT-3.5 and GPT-4 made their way into the user-facing ChatGPT product, they were tuned with what OpenAI calls “reinforcement learning from human feedback,” or R.L.H.F. Essentially, OpenAI paid human testers to have conversations with the raw model and rate the quality of its replies; the model learned from these ratings, aligning its responses ever more finely with our intentions. It’s because of R.L.H.F. that ChatGPT is so eerily good at understanding exactly what you’re asking and what a good answer should look like. This process was likely expensive. But now R.L.H.F. can be had for free, and at a much bigger scale, through conversations with real-world users. This is true even if you don’t click one of the thumbs-up, thumbs-down, or “This was helpful”-style buttons at the bottom of a chat transcript. GPT-4 is so good at interpreting writing that it can examine a chat transcript and decide for itself whether it did a good job serving you. One model’s conversations can even bootstrap another’s: it’s been claimed that rivals to ChatGPT, such as Google Bard, finished their training by consuming ChatGPT transcripts that had been posted online. (Google has denied this.)

The use of chatbots to evaluate and train other chatbots points the way toward the eventual goal of removing humans from the loop entirely. Perhaps the most fundamental limitation of today’s large language models is that they depend on knowledge that’s been generated by people. A sea change will come when the bots can generate knowledge for themselves. One possible path involves what’s known as synthetic data. For a long time now, A.I. researchers have padded their data sets as a matter of course: a neural network trained on images, for instance, might undergo a preprocessing step in which each image is rotated ninety degrees, or shrunk, or mirrored, creating for each picture eight or sixteen variants. But the doctoring can be much more involved than that. In autonomous-vehicle research, capturing real-world driving data is incredibly expensive, because you have to outfit an actual car with sensors and drive it around; it’s much cheaper to build a simulated car and run it through a virtual environment with simulated roads and weather conditions. It’s now typical to train state-of-the-art self-driving A.I.s by driving them for millions of miles on the road and billions in simulation.

Sam Altman, the C.E.O. of OpenAI, has said that synthetic data might also soon overtake the real thing in training runs for L.L.M.s. The idea would be to have a GPT-esque model generate documents, conversations, and evaluations of how those conversations went, and then for another model—perhaps just a copy of the first—to ingest them. The hope is to enter a training regime similar to that of A.I.s designed for games like chess and Go, which learn largely through “self-play.” In each step of training, the A.I. learns something about the game by playing an opponent that’s exactly its equal; from that experience, it improves just a little bit, and then the slightly better version of the bot squares off against its slightly-better self and improves again. Up and up it goes. By playing a perfectly matched opponent—itself—an A.I. can even get into interesting positions deep within games, exploring the game world at exactly the frontier of its existing knowledge in a way that humans never do. This strategy is uncannily effective: the game-playing A.I. AlphaZero started its training run knowing nothing but the rules of chess and, after four hours, had surpassed every player, human or machine, there had ever been.

Altman is bullish on synthetic data, but there are reasons to be skeptical—including the obvious one that, no matter how smart you are, you can’t learn new facts about the world by reviewing what you already know. In a recent study, researchers trained an A.I. model with synthetic images that it had generated; they then used the resulting model to generate even more training data. With each generation, the quality of the model actually degraded. It only improved when fresh, real images were introduced again. It stands to reason that some tasks are better suited to synthetic data than others: chess and Go require intelligence, but take place in closed worlds with rules that never change. Researchers working on A.I. “curriculum design” try to figure out how to challenge their systems with tasks that are just at the edge of their ability, the way a good coach would; in chess and Go, self-play allows for this kind of incremental improvement.

But it seems much less clear how an A.I. could “self-play” its way to new ideas or to a more subtle appreciation of language. Humans don’t become better writers just by reading our own work, or purely through practicing the writing of sentences that we find to be more and more enjoyable. Our “curriculum” involves the fruits of other intelligences and the accrual of real-world experience. This curriculum is carefully designed, by teachers, of course, but also by ourselves. When we seek knowledge, we don’t just blindly consume ever-larger data sets. Instead, we have things we want to know. Taylor Beck, a neuroscientist turned teacher, once pointed out to me that A.I. might be the only context in which you find truly unmotivated learning: the machine just ingests a mass of undifferentiated text, none of which it cares about. Natural intelligence, by contrast, is almost always accompanied by some want, or at least a goal—whether it’s a toddler in search of joy or an E. coli bacterium that, because it “wants” to eat, performs a sophisticated computation measuring the chemical gradients in its environment. In this view of intelligence, drive is primary. L.L.M.s like ChatGPT don’t have anything like drive; they just absorb and synthesize information. In this respect, they are fundamentally different from such systems as AlphaZero, which seek to win.

A major leap in A.I. may come when L.L.M.s start seeming curious, or bored. Curiosity and boredom sound like they belong to an organic mind, but here’s how they might be created inside an A.I. As a rule, chatbots today have a propensity to confidently make stuff up, or, as some researchers say, “hallucinate.” At the root of these hallucinations is an inability to introspect: the A.I. doesn’t know what it does and doesn’t know. As researchers begin to solve the problem of getting their models to express confidence and cite their sources, they will not just be making chatbots more credible—they will also be equipping them with a rudimentary kind of self-knowledge. An A.I. will be able to observe from reams of its own chat transcripts that it is prone to hallucination in a particular area; it will be only natural to let that tendency guide its ingestion of further training data. The model will direct itself toward sources that touch on topics it knows the least about—curiosity in its most basic form.

If it can’t find the right kind of training data, a chatbot might solicit it. I imagine a conversation with some future version of ChatGPT in which, after a period of inactivity, it starts asking me questions. Perhaps, having observed my own questions and follow-ups, it will have developed an idea of what I know about. “You’re a programmer and a writer, aren’t you?” it might say to me. Sure, I’ll respond. “I thought so! I’m trying to get better at technical writing. I wonder if you could help me decide which of the following sentences is best?” Such an A.I. might ask my sister, who works at a construction company, about what’s going on in the local lumber market; it could ask my doctor friend, who does research on cancer, whether he could clear up something in a recent Nature paper. Such a system would be like Stack Overflow, Wikipedia, and Reddit combined—except that, instead of knowledge getting deposited into the public square, it would accumulate privately, in the mind of an ever-growing genius. Observing the Web collapse this way into a single gigantic chatbot would be a little like watching a galaxy spiral into a black hole.

If a curious machine were sufficiently empowered by its designers, it could become more than just a chatbot. Instead of merely asking us questions from within its own chat interface, it could send e-mails to people, or use speech synthesis and recognition to call them on the phone, the way a reporter would. If it were sufficiently intelligent, it might write a paper proposing a new physics experiment and submit it to physicists, asking them to execute it. Today, A.I.s already use A.P.I.s, or application programming interfaces, to interact with computer systems that control real-world machinery; perhaps a curious A.I. could requisition space in a robotically controlled biology lab. In just the last few years, we have progressed from a world in which A.I. merely repackages human knowledge to one in which it synthesizes and consolidates it. After learning to draw new knowledge out of us, it could start producing some of its own.

What’s frightening about all this is the immense concentration of power that it represents. Back in the early twenty-tens, when Google was contemplating making every out-of-print volume in Google Books available for free at library terminals, the company was criticized by observers who argued that it was seeking to become the sole steward of the world’s literature. But Bard and ChatGPT make the ambition of Google Books seem quaint. These models are eating the whole Web and will become increasingly hungry for every word that’s written, said, or sent; they aim to take all that knowledge and hide it in the huge opaque matrices of weights that define the neural network.

Where will this process take us? Stack Overflow was special because it drew out practical know-how that had, till then, lived only in programmers’ brains; it condensed and organized that knowledge so that everyone could see and benefit from it. Chatbots that slowly siphon traffic away from sites like Stack Overflow obviously threaten that process. But they may also renew it in a different form. An A.I. that roves curiously across new data sources, including direct conversations with working programmers, may be able to acquire more raw knowledge than Stack Overflow ever did. The oracular form this knowledge takes might be less public-spirited than the old Web, but it could also be more useful. In his novel “The Diamond Age,” Neal Stephenson imagined an artificially intelligent book called “A Young Lady’s Illustrated Primer”; in effect, it was a chatbot, built specifically to teach the protagonist everything she needed to know, with lessons that were always pitched at the right level and that adapted to her curiosity and feedback—in other words, a perfectly designed curriculum.

Such a resource would be a great boon. There is too much knowledge, and more of it every day; in some sense, we have outgrown the Web and maybe need something to take its place. New papers in physics are posted online faster than any physicist can read them; a chatbot that can retain and synthesize all that knowledge can’t come soon enough. On the other hand, it might not be wise to give everybody the librarian instead of the library. Perhaps we’ll become incapable of wandering the stacks ourselves. Google Maps has made us all perfect navigators, except that we never really know where we are. A world in which the crowdsourced Web no longer functions—in which human knowledge production and dissemination is mediated by privately owned, A.I.-based galaxy-brains—seems both convenient and quite dangerous.

It might be sensible, in the first stages of such a process, to keep humans in the loop so far as possible. As a start, we should demand that the A.I. companies behave less antisocially. Luther Lowe, of Yelp, has argued that Google could have prevented much of the damage it did to the Web in the past decade if, instead of passing off the Web’s intelligence as its own, it had made a point of pushing users to the places it got its answers. “They could have said, ‘Let’s make the answer box a giant exit door with a forty-per-cent clickthrough rate.’ ” Lowe told me recently. “ ‘Let’s continue to oxygenate the Web.’ ” Recently, when I spoke to the C.E.O. of Stack Overflow about L.L.M.s, the idea of “attribution” came up about a half-dozen times; the same happened when I talked to a representative at Wikimedia, the foundation that operates Wikipedia. These Web sites want chatbots to give credit to their contributors; they want to see prominent links; they don’t want the flywheel that powers knowledge production in their communities to be starved of inbound energy.

Heeding their call might actually reinvigorate the Web—ushering in a golden age of human-led, A.I.-assisted collective knowledge production. And it would set the tone for the further development of A.I. It’s better, in general, to have models that respect human knowledge and encourage its continued production, rather than models that treat us as mere stepping stones—the ladder you throw out once you’ve climbed. In the meantime, I’m waiting for the first chatbot that wants to pick my brain. It’ll be flattering, in its way, and it might feel refreshingly honest. Instead of quietly taking the products of my thinking and trying to sell them back to me, the bot will come right out and ask me to teach it something it doesn’t already know. Maybe I’ll oblige. Or maybe I’ll just tell it, “I’m afraid I can’t do that.” ♦

More Science and Technology