Progress in AI systems often feels cyclical. Every few years, computers can suddenly do something they’ve never been able to do before. “Behold!” the AI true believers proclaim, “the age of artificial general intelligence is at hand!” “Nonsense!” count the skeptics. “Remember self-driving cars?”
The truth usually lies somewhere in between.
We’re in another cycle, this time with generative AI. Media headlines are dominated by news about AI art, but there’s also unprecedented progress in many widely disparate fields. Everything from videos to biology, programming, writing, translation, and more is seeing AI progress at the same incredible pace.
Why is all this happening now?
You may be familiar with the latest happenings in the world of AI. You’ve seen the prize-winning artwork, heard the interviews between dead people, and read about the protein-folding breakthroughs. But these new AI systems aren’t just producing cool demos in research labs. They’re quickly being turned into practical tools and real commercial products that anyone can use.
There’s a reason all of this has come at once. The breakthroughs are all underpinned by a new class of AI models that are more flexible and powerful than anything that has come before. Because they were first used for language tasks like answering questions and writing essays, they’re often known as large language models (LLMs). OpenAI’s GPT3, Google’s BERT, and so on are all LLMs.
But these models are extremely flexible and adaptable. The same mathematical structures have been so useful in computer vision, biology, and more that some researchers have taken to calling them “foundation models” to better articulate their role in modern AI.
Where did these foundational models come from, and how have they broken out beyond language to drive so much of what we see in AI today?
The foundation of foundation models
There’s a holy trinity in machine learning: models, data, and compute. Models are algorithms that take inputs and produce outputs. Data refers to the examples the algorithms are trained on. To learn something, there must be enough data with enough richness that the algorithms can produce useful output. Models must be flexible enough to capture the complexity of the data. And finally, there has to be enough computing power to run the algorithms.
The first modern AI revolution took place with deep learning in 2012, when solving computer vision problems with convolutional neural networks (CNNs) took off. CNNs are similar in structure to the brain’s visual cortex. They’ve been around since the 1990s but weren’t yet practical due to their intense computing power requirements.
In 2006, though, Nvidia released CUDA, a programming language that allowed for the use of GPUs as general-purpose supercomputers. In 2009, Stanford AI researchers introduced Imagenet, a collection of labeled images used to train computer vision algorithms. In 2012, AlexNet combined CNNs trained on GPUs with Imagenet data to create the best visual classifier the world had ever seen. Deep learning and AI exploded from there.
CNNs, the ImageNet data set, and GPUs were a magic combination that unlocked tremendous progress in computer vision. 2012 set off a boom of excitement around deep learning and spawned entire industries, like those involved in autonomous driving. But we quickly learned there were limits to that generation of deep learning. CNNs were great for vision, but other areas did not have their model breakthrough. One huge gap was in natural language processing (NLP)—ie, getting computers to understand and work with normal human language rather than code.
The problem of understanding and working with language is fundamentally different from that of working with images. Processing language requires working with sequences of words, where order matters. A cat is a cat no matter where it is in an image, but there’s a big difference between “this reader is learning about AI” and “AI is learning about this reader.”
Until recently, researchers relied on models like recurrent neural networks (RNNs) and long short-term memory (LSTM) to process and analyze data in time. These models were effective at recognizing short sequences, like spoken words from short phrases, but they struggled to handle longer sentences and paragraphs. The memory of these models was just not sophisticated enough to capture the complexity and richness of ideas and concepts that arise when sentences are combined into paragraphs and essays. They were great for simple Siri- and Alexa-style voice assistants but not for much else.
Getting the right training data was another challenge. ImageNet was a collection of one hundred thousand labeled images that required significant human effort to generate, mostly by grad students and Amazon Mechanical Turk workers. And ImageNet was actually inspired by and modeled on an older project called WordNet, which tried to create a labeled data set for English vocabulary. While there is no shortage of text on the Internet, creating a meaningful data set to teach a computer to work with human language beyond individual words is incredibly time-consuming. And the labels you create for one application on the same data may not apply to another task.