How AI Learned to Pay Attention

The little “highlighter” trick behind GPT

Simcha AI

Dec 15, 2025

So far in this series:

We saw how early models played a guess-the-next-word game.
We met LSTMs, which made that game less forgetful.
We detoured through cats + GPUs, where computers learned to see patterns in images.

This article is about one last ingredient:

How modern AI learned to focus on the important parts of what someone says or writes.

That ingredient is called attention.

Quick reminder: where LSTMs left off

In the earlier article, LSTMs were the “better listener”:

Old models read a sentence and forgot almost everything by the end.
LSTMs were better at holding onto important bits and dropping noise as they moved word by word.

But they still had one big limitation:

They read like someone walking down a hallway with blinders on—
always moving forward, only really seeing what’s right in front of them.

If something crucial was said at the start, it had to survive being passed along through every step of the sentence or paragraph.

Helpful, but not how humans usually read or listen.

How people actually read notes

Take a line like this:

“Feeling more on edge in group settings, skipping team lunches, and noticing this has gotten worse over the past month since the promotion; denies full-blown panic attacks.”

A person doesn’t treat that as a long string of equal words.

The mind quietly does things like:

Connect “on edge” with “group settings” and “skipping team lunches.”
Link “worse over the past month” with “since the promotion.”
Notice “denies full-blown panic attacks” as a limit on how severe this is.

You skim, jump back and forth, and mentally highlight what matters:

What changed?
When did it change?
How bad is it?
What’s context and what’s noise?

That mental highlighting is the everyday version of attention.

The core idea: give AI a highlighter

Attention is basically this question, asked over and over inside the model:

“For what I’m doing right now,
which parts of this text should I pay the most attention to?”

Instead of carrying one fragile memory from left to right, the model can:

Look at the whole sentence or paragraph.
Decide which words are most related to each other.
Give the important parts a stronger “glow” inside its math.
Use that focused view to decide what to say next.

Back to the example:

Thinking about “on edge” → it leans on “group settings” and “skipping team lunches.”
Thinking about “worse” → it leans on “past month” and “since the promotion.”
Seeing “denies” → it leans on “full-blown panic attacks.”

Everything is still visible.
But the model stops treating everything as equally important.

That’s all “attention” really is here:

Noticing what matters, and quietly turning up the volume on those parts.

How this is different from LSTMs

If you like visual metaphors:

LSTM: one long page of notes, and you keep writing at the bottom. You try to remember what mattered at the top.
Attention: the same page of notes, but now you can instantly scan the whole page and highlight whatever is relevant at this moment.

LSTMs helped with remembering.
Attention helps with finding and focusing.

Modern models use both ideas, but attention is what lets them handle much longer, messier text without getting lost.

The Famous Paper: “Attention Is All You Need”

In 2017, a group of researchers turned this highlighter idea into a full model design, called a Transformer.
The paper’s title was:

“Attention Is All You Need”

You can think of a Transformer as a stack of “focus blocks”:

Each block looks over the text and decides what to highlight.
Then it thinks a bit about what it just focused on.
Then it passes that forward to the next block, which does the same thing at a deeper level.

This design turned out to be:

Very good at language tasks (like translation, summarizing, etc.)
Very friendly to GPUs, so it could scale up to big models

That’s why almost every modern language model is built on some version of this idea.

From attention to GPT: the giant guessing game

Once this “highlighter” model existed, another group of researchers took it and said:

“Let’s give it a lot of text
and let it play the next-word guessing game for a very long time.”

They called this generative pre-training:

Show the model real text.
Hide the next word.
Ask it to guess.
Tell it if it was wrong.
Repeat this billions of times.

The result is a Generative Pre-trained Transformer—or GPT:

Transformer → the attention-based “focus block” design
Pre-trained → it learned from a mountain of text before you ever typed anything
Generative → it can now write new text, not just label things

When you ask it to summarize, rewrite, or brainstorm, it’s basically doing:

“Given everything I’ve read in this conversation and everything I practiced on,
what’s a reasonable next sentence or phrase?”

Attention is the part that lets it look back over what you gave it and actually use the important bits, not just the last one or two lines.

Why this matters in day-to-day use

Because of attention, models can now:

Handle longer text
They can keep track of things said at the beginning, not just whatever is most recent.
Pick out the key themes
Repeated patterns (like “feeling like a burden,” or “since the promotion”) can get highlighted across multiple paragraphs.
Ignore a lot of boilerplate
Template sections and repeated phrases can be downplayed so the unique parts stand out.
Summarize without losing the point
Because the model is literally focusing on the important pieces, the “short version” has a better chance of capturing what matters.

LSTMs took us from “goldfish memory” to “slightly better memory.”
Attention takes us from “linear note-taker” to “someone who can flip through the whole notebook and quickly spot what’s important.”

LSTMs taught AI to remember more of what it heard.
Attention taught it to notice and highlight the most important parts—
and that little focusing trick is what made today’s GPT-style models feel much more present and useful. ⚡

References

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
PDF: https://www.bioinf.jku.at/publications/older/2604.pdf (bioinf.jku.at)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, 30 (NeurIPS 2017).
arXiv: https://arxiv.org/abs/1706.03762
NeurIPS version: https://papers.nips.cc/paper/7181-attention-is-all-you-need (arXiv)
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI Technical Report.
PDF: https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (OpenAI)
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., et al. (2020). Language models are few-shot learners. In Advances in Neural Information Processing Systems, 33 (NeurIPS 2020).
arXiv: https://arxiv.org/abs/2005.14165
NeurIPS PDF: https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf (arXiv)

Simcha's Newsletter

Discussion about this post

Ready for more?