Improvements in the performance of large language models such as ChatGPT are more predictable than they appear.
Matthew Hutson
Some researchers think that AI could eventually achieve general intelligence, matching and even exceeding humans on most tasks.Credit: Charles Taylor/Alamy
Will an artificial intelligence (AI) superintelligence appear suddenly, or will scientists see it coming, and have a chance to warn the world? That’s a question that has received a lot of attention recently, with the rise of large language models, such as ChatGPT, which have achieved vast new abilities as their size has grown. Some findings point to “emergence”, a phenomenon in which AI models gain intelligence in a sharp and unpredictable way. But a recent study calls these cases “mirages” — artefacts arising from how the systems are tested — and suggests that innovative abilities instead build more gradually.
“I think they did a good job of saying ‘nothing magical has happened’,” says Deborah Raji, a computer scientist at the Mozilla Foundation who studies the auditing of artificial intelligence. It’s “a really good, solid, measurement-based critique.”
The work was presented last week at the NeurIPS machine-learning conference in New Orleans.
Bigger is better
Large language models are typically trained using huge amounts of text, or other information, whch they use to generate realistic answers by predicting what comes next. Even without explicit training, they manage to translate language, solve mathematical problems and write poetry or computer code. The bigger the model is — some have more than a hundred billion tunable parameters — the better it performs. Some researchers suspect that these tools will eventually achieve artificial general intelligence (AGI), matching and even exceeding humans on most tasks.
ChatGPT broke the Turing test — the race is on for new ways to assess AI
The new research tested claims of emergence in several ways. In one approach, the scientists compared the abilities of four sizes of OpenAI’s GPT-3 model to add up four-digit numbers. Looking at absolute accuracy, performance differed between the third and fourth size of model from nearly 0% to nearly 100%. But this trend is less extreme if the number of correctly predicted digits in the answer is considered instead. The researchers also found that they could also dampen the curve by giving the models many more test questions — in this case the smaller models answer correctly some of the time.
Next, the researchers looked at the performance of Google’s LaMDA language model on several tasks. The ones for which it showed a sudden jump in apparent intelligence, such as detecting irony or translating proverbs, were often multiple-choice tasks, with answers scored discretely as right or wrong. When, instead, the researchers examined the probabilities that the models placed on each answer — a continuous metric — signs of emergence disappeared.
Finally, the researchers turned to computer vision, a field in which there are fewer claims of emergence. They trained models to compress and then reconstruct images. By merely setting a strict threshold for correctness, they could induce apparent emergence. “They were creative in the way that they designed their investigation,” says Yejin Choi, a computer scientist at the University of Washington in Seattle who studies AI and common sense.
Nothing ruled out
Study co-author Sanmi Koyejo, a computer scientist at Stanford University in Palo Alto, California, says that it wasn’t unreasonable for people to accept the idea of emergence, given that some systems exhibit abrupt “phase changes”. He also notes that the study can’t completely rule it out in large language models — let alone in future systems — but adds that “scientific study to date strongly suggests most aspects of language models are indeed predictable”.
Raji is happy to see the community pay more attention to benchmarking, rather than to developing neural-network architectures. She’d like researchers to go even further and ask how well the tasks relate to real-world deployment. For example, does acing the LSAT exam for aspiring lawyers, as GPT-4 has done, mean that a model can act as a paralegal?
The work also has implications for AI safety and policy. “The AGI crowd has been leveraging the emerging-capabilities claim,” Raji says. Unwarranted fear could lead to stifling regulations or divert attention from more pressing risks. “The models are making improvements, and those improvements are useful,” she says. “But they’re not approaching consciousness yet.”
doi: https://doi.org/10.1038/d41586-023-04094-z
Leave a Reply