NLP Overview
Natural language processing is teaching computers to understand and generate human language. The field has gone through dramatic shifts, with the latest (LLMs) reshaping what's possible.
This page surveys the field.
Historical eras
Rule-based (1950s-1990s)
Hand-written rules for syntax and semantics. Brittle but interpretable. Worked for limited domains.
Statistical (1990s-2010s)
Hidden Markov Models, Conditional Random Fields, Support Vector Machines. Required hand-engineered features.
Machine translation, named entity recognition, sentiment analysis worked acceptably.
Word embeddings (2013+)
word2vec, GloVe — dense vector representations. Captured semantic similarity.
Enabled deep learning to flourish in NLP.
Sequence models (2014+)
RNNs, LSTMs, GRUs handled sequences. Became standard for many NLP tasks.
Transformers (2017+)
"Attention is All You Need" introduced transformers. Replaced RNNs by 2019.
Architecture choice for nearly all modern NLP.
Pretrained transformers (2018+)
BERT, GPT-2 — pretrain on huge text corpora, fine-tune for tasks.
Set new state-of-the-art across NLP tasks.
Large language models (2020+)
GPT-3, GPT-4, Claude, Gemini. Few-shot and zero-shot capabilities.
Reshaped how NLP problems are approached.
Major tasks
Text classification
Assign categories to text:
- Spam detection
- Sentiment analysis
- Topic classification
- Intent recognition
Approaches:
- TF-IDF + logistic regression (still a strong baseline)
- Fine-tuned BERT
- LLM zero-shot
Named entity recognition (NER)
Find entity mentions: people, places, organizations, dates, etc.
Pretrained models (spaCy, BERT-NER) are strong out of the box.
Information extraction
Extract structured information from text.
Often combines NER with relation extraction.
LLMs handle this well with prompting.
Sequence labeling
Per-token labels: part-of-speech, chunking, dependency parsing.
Mostly solved problems.
Coreference resolution
Determine which expressions refer to the same entity. ("Alice... she...")
Hard; getting better with modern models.
Machine translation
Translate between languages.
Transformer-based (Google Translate, DeepL) dominates. LLMs are competitive.
Summarization
Condense long text into shorter form.
Extractive: pick important sentences.
Abstractive: generate new text.
LLMs handle this naturally.
Question answering
Extractive: find span in document that answers question.
Generative: produce free-form answer.
Modern: retrieval + LLM (RAG).
Dialogue / conversation
Multi-turn interaction. Once specialized; now general LLM capability.
Text generation
Produce text from prompt or condition.
LLMs dominate.
Search and retrieval
Find relevant documents for a query.
Approaches:
- BM25 (classical, still excellent)
- Dense retrieval (sentence-transformers)
- Hybrid (best in practice)
- Cross-encoders for reranking
Speech (related)
Speech recognition and synthesis are NLP-adjacent. Now end-to-end deep learning.
Building blocks
Tokenization
Splitting text into units. Modern: subword (BPE, WordPiece, SentencePiece).
Different tokenizers give different segmentations. Match the tokenizer to the model.
Embeddings
Dense vectors for tokens, words, sentences, or documents.
Semantic similarity = vector similarity (cosine).
Attention
Weighted combination of representations. Allows the model to focus on relevant parts.
The breakthrough enabling transformers.
Pretraining
Train on huge unlabeled text. Predict masked tokens (BERT) or next tokens (GPT).
Captures language patterns; transferable to many tasks.
Fine-tuning
Adapt pretrained model to a specific task with labeled data.
Less data needed than training from scratch.
Current landscape
Open models
- Llama (Meta): foundation models in many sizes
- Mistral / Mixtral: strong open weights
- Qwen, Gemma, Phi: alternative families
- Specialized: code (Codestral), embedding (BGE, E5)
Quality has caught up to closed models for many use cases.
Closed APIs
- GPT-4, GPT-4o (OpenAI)
- Claude (Anthropic)
- Gemini (Google)
Best raw quality; pay per use; no data privacy guarantees by default.
Embeddings
- OpenAI ada, text-embedding-3-*
- Cohere Embed
- Open source: BGE, E5, sentence-transformers
For most retrieval, modern open embeddings are competitive.
Practical patterns
RAG (Retrieval-Augmented Generation)
Retrieve relevant docs; pass to LLM. Standard for question answering on private data.
Few-shot prompting
Show examples in the prompt; LLM generalizes. Powerful for prototyping.
Fine-tuning small model
For high-volume tasks: fine-tune small specialized model rather than calling large API per request. Cheaper, faster.
Distillation from LLM
Use LLM to label training data; train smaller model.
Structured output
For information extraction: ask LLM for JSON; parse.
Tools like Outlines, Instructor help reliability.
Agents
LLMs that can call tools. Useful for complex multi-step workflows.
Common pitfalls
Tokenization mismatches
Subtle bugs from different tokenizers in training vs inference.
Hallucinations
LLMs generate plausible falsehoods. Always verify factual claims.
Prompt injection
User input that overrides instructions. Major security concern for production.
Distribution shift
Models trained on general English may fail on specialized text (medical, legal, code).
Underestimating data work
NLP success often depends on data quality more than model choice.
Overusing LLMs
LLMs are expensive. For high-volume simple tasks, smaller models work.
Evaluation
Metrics depend on task:
- **Classification**: accuracy, F1
- **Generation**: BLEU, ROUGE, METEOR (imperfect)
- **Retrieval**: NDCG, recall@k
- **End-to-end**: human evaluation
Don't trust automated generation metrics. Always sample outputs and read.
Where NLP is going
- Multimodal models (text + image + audio)
- Longer context (millions of tokens)
- Better calibration and reduced hallucinations
- Better small models
- Domain specialization
- Tool use / agents
Further Reading
- [DataScienceNLP](DataScienceNLP) — Practical NLP for data science
- [TransformerArchitecture](TransformerArchitecture) — Modern foundation
- [SentimentAnalysisWithMachineLearning](SentimentAnalysisWithMachineLearning) — A specific task
- [ML Hub](MLHub) — Cluster index