NLP Overview

Natural language processing is teaching computers to understand and generate human language. The field has gone through dramatic shifts, with the latest (LLMs) reshaping what's possible.

This page surveys the field.

Historical eras

Rule-based (1950s-1990s)

Hand-written rules for syntax and semantics. Brittle but interpretable. Worked for limited domains.

Statistical (1990s-2010s)

Hidden Markov Models, Conditional Random Fields, Support Vector Machines. Required hand-engineered features.

Machine translation, named entity recognition, sentiment analysis worked acceptably.

Word embeddings (2013+)

word2vec, GloVe — dense vector representations. Captured semantic similarity.

Enabled deep learning to flourish in NLP.

Sequence models (2014+)

RNNs, LSTMs, GRUs handled sequences. Became standard for many NLP tasks.

Transformers (2017+)

"Attention is All You Need" introduced transformers. Replaced RNNs by 2019.

Architecture choice for nearly all modern NLP.

Pretrained transformers (2018+)

BERT, GPT-2 — pretrain on huge text corpora, fine-tune for tasks.

Set new state-of-the-art across NLP tasks.

Large language models (2020+)

GPT-3, GPT-4, Claude, Gemini. Few-shot and zero-shot capabilities.

Reshaped how NLP problems are approached.

Major tasks

Text classification

Assign categories to text:

Spam detection
Sentiment analysis
Topic classification
Intent recognition

Approaches:

TF-IDF + logistic regression (still a strong baseline)
Fine-tuned BERT
LLM zero-shot

Named entity recognition (NER)

Find entity mentions: people, places, organizations, dates, etc.

Pretrained models (spaCy, BERT-NER) are strong out of the box.

Information extraction

Extract structured information from text.

Often combines NER with relation extraction.

LLMs handle this well with prompting.

Sequence labeling

Per-token labels: part-of-speech, chunking, dependency parsing.

Mostly solved problems.

Coreference resolution

Determine which expressions refer to the same entity. ("Alice... she...")

Hard; getting better with modern models.

Machine translation

Translate between languages.

Transformer-based (Google Translate, DeepL) dominates. LLMs are competitive.

Summarization

Condense long text into shorter form.

Extractive: pick important sentences. Abstractive: generate new text.

LLMs handle this naturally.

Question answering

Extractive: find span in document that answers question. Generative: produce free-form answer.

Modern: retrieval + LLM (RAG).

Dialogue / conversation

Multi-turn interaction. Once specialized; now general LLM capability.

Text generation

Produce text from prompt or condition.

LLMs dominate.

Search and retrieval

Find relevant documents for a query.

Approaches:

BM25 (classical, still excellent)
Dense retrieval (sentence-transformers)
Hybrid (best in practice)
Cross-encoders for reranking

Speech recognition and synthesis are NLP-adjacent. Now end-to-end deep learning.

Building blocks

Tokenization

Splitting text into units. Modern: subword (BPE, WordPiece, SentencePiece).

Different tokenizers give different segmentations. Match the tokenizer to the model.

Embeddings

Dense vectors for tokens, words, sentences, or documents.

Semantic similarity = vector similarity (cosine).

Attention

Weighted combination of representations. Allows the model to focus on relevant parts.

The breakthrough enabling transformers.

Pretraining

Train on huge unlabeled text. Predict masked tokens (BERT) or next tokens (GPT).

Captures language patterns; transferable to many tasks.

Fine-tuning

Adapt pretrained model to a specific task with labeled data.

Less data needed than training from scratch.

Current landscape

Open models

Llama (Meta): foundation models in many sizes
Mistral / Mixtral: strong open weights
Qwen, Gemma, Phi: alternative families
Specialized: code (Codestral), embedding (BGE, E5)

Quality has caught up to closed models for many use cases.

Closed APIs

GPT-4, GPT-4o (OpenAI)
Claude (Anthropic)
Gemini (Google)

Best raw quality; pay per use; no data privacy guarantees by default.

Embeddings

OpenAI ada, text-embedding-3-*
Cohere Embed
Open source: BGE, E5, sentence-transformers

For most retrieval, modern open embeddings are competitive.

Practical patterns

RAG (Retrieval-Augmented Generation)

Retrieve relevant docs; pass to LLM. Standard for question answering on private data.

Few-shot prompting

Show examples in the prompt; LLM generalizes. Powerful for prototyping.

Fine-tuning small model

For high-volume tasks: fine-tune small specialized model rather than calling large API per request. Cheaper, faster.

Distillation from LLM

Use LLM to label training data; train smaller model.

Structured output

For information extraction: ask LLM for JSON; parse.

Tools like Outlines, Instructor help reliability.

Agents

LLMs that can call tools. Useful for complex multi-step workflows.

Common pitfalls

Tokenization mismatches

Subtle bugs from different tokenizers in training vs inference.

Hallucinations

LLMs generate plausible falsehoods. Always verify factual claims.

Prompt injection

User input that overrides instructions. Major security concern for production.

Distribution shift

Models trained on general English may fail on specialized text (medical, legal, code).

Underestimating data work

NLP success often depends on data quality more than model choice.

Overusing LLMs

LLMs are expensive. For high-volume simple tasks, smaller models work.

Evaluation

Metrics depend on task:

Classification: accuracy, F1
Generation: BLEU, ROUGE, METEOR (imperfect)
Retrieval: NDCG, recall@k
End-to-end: human evaluation

Don't trust automated generation metrics. Always sample outputs and read.

Where NLP is going

Multimodal models (text + image + audio)
Longer context (millions of tokens)
Better calibration and reduced hallucinations
Better small models
Domain specialization
Tool use / agents

NLP Overview

Historical eras

Rule-based (1950s-1990s)

Statistical (1990s-2010s)

Word embeddings (2013+)

Sequence models (2014+)

Transformers (2017+)

Pretrained transformers (2018+)

Large language models (2020+)

Major tasks

Text classification

Named entity recognition (NER)

Information extraction

Sequence labeling

Coreference resolution

Machine translation

Summarization

Question answering

Dialogue / conversation

Text generation

Search and retrieval

Speech (related)

Building blocks

Tokenization

Embeddings

Attention

Pretraining

Fine-tuning

Current landscape

Open models

Closed APIs

Embeddings

Practical patterns

RAG (Retrieval-Augmented Generation)

Few-shot prompting

Fine-tuning small model

Distillation from LLM

Structured output

Agents

Common pitfalls

Tokenization mismatches

Hallucinations

Prompt injection

Distribution shift

Underestimating data work

Overusing LLMs

Evaluation

Where NLP is going

Further Reading