Large language model

A large language model (LLM) is a type of artificial intelligence that can understand and create human language. These models learn by studying huge amounts of text from books, websites, and other sources.^[1]

How they work

LLMs work by finding patterns in language. They learn grammar, facts, and how words relate to each other by looking at billions of examples. The most powerful LLMs use a special design called "transformer," which helps them process large amounts of text quickly.^[2]

Limitations

While LLMs are powerful, they can make mistakes. They sometimes include biases from their training data, and they can produce incorrect information. They learn from existing text rather than having true understanding like humans do.^[3]

History

Before 2017, language models were much simpler. The big change came when Google created the "transformer" design, which made language models much more powerful.^[4]

Important developments include:

2018: BERT was released, which helped computers better understand language^[5]
2019: GPT-2 was created but was considered so powerful that its creators worried about misuse^[6]
2022: ChatGPT was released and became very popular with the public^[7]
2023: GPT-4 came out and could understand both text and images^[8]

Modern developments

Today, there are many different LLMs available. Some are private, like GPT-4, while others are open for anyone to use, like Deepseek and Mistral. As of 2024^[update], GPT-4 was considered one of the most capable language models.^[9]

Large Language Model Media

The number of publications about large language models by year grouped by publication types
The training compute of notable large models in FLOPs vs publication date over the period 2010–2024. For overall notable models (top left), frontier models (top right), top language models (bottom left) and top models within leading companies (bottom right). The majority of these models are language models.
The training compute of notable large AI models in FLOPs vs publication date over the period 2017–2024. The majority of large models are language models or multimodal models with language capacity.
An illustration of the main components of the transformer model from the original paper, where layers were normalized after (instead of before) multiheaded attention
When each head calculates, according to its own criteria, how much other tokens are relevant for the "it_" token, note that the second attention head, represented by the second column, is focusing most on the first two rows, i.e. the tokens "The" and "animal", while the third column is focusing most on the bottom two rows, i.e. on "tired", which has been tokenized into two tokens.
At point(s) referred to as breaks, the lines change their slopes, appearing on a linear-log plot as a series of linear segments connected by arcs.
According to research institute Epoch AI, energy consumption per typical ChatGPT query (0.3 watt-hours) is small compared to the average U.S. household consumption per minute (almost 20 watt-hours).