Data Science NLP

Most data science work touches text eventually — customer feedback, support tickets, descriptions, logs. NLP turns text into features or predictions.

This page covers the practical pipeline from raw text to models.

The text-data pipeline

1. Collect/load text

2. Clean and normalize

3. Tokenize

4. Convert to numerical representation

5. Apply model

6. Evaluate

Each step has choices that affect downstream quality.

Cleaning

Text data is messy:

- HTML/XML tags

- URLs, email addresses

- Unicode oddities

- Misspellings

- Emojis

- Multiple languages

Decisions:

- Strip or preserve HTML?

- Lowercase or preserve case?

- Remove punctuation or keep?

- Handle emojis as features or noise?

Context-dependent. Sentiment analysis on tweets cares about emojis; legal document analysis doesn't.

Tokenization

Splitting text into units (tokens).

Word tokenization

Split on whitespace, handle punctuation. Simple but breaks for languages without word boundaries.

Subword tokenization

BPE, WordPiece, SentencePiece — break words into subword units.

Used by all modern transformers. Handles rare words and morphology.

Character tokenization

Each character is a token. Maximum vocabulary efficiency, longer sequences.

Use the tokenizer matching your model. Don't custom-tokenize for pretrained models.

Stop words and stemming

Classical preprocessing:

- **Stop words**: remove common words ("the", "is")

- **Stemming**: reduce to root form ("running" → "run")

- **Lemmatization**: smarter version using grammar

For modern transformer models: skip these. The model handles them.

For classical models (TF-IDF + logistic regression): may help.

Numerical representations

Bag of words

Each document is a vector of word counts. Simple, interpretable, sparse.

TF-IDF

Word counts weighted by inverse document frequency. Common words get lower weight.

Strong baseline for many text tasks.

Word embeddings

Dense vectors per word: word2vec, GloVe, fastText.

Words with similar meanings have similar vectors.

Pre-deep-learning innovation; still useful.

Contextual embeddings

BERT, RoBERTa, etc. Embeddings depend on context.

"Bank" in "river bank" vs "bank account" gets different vectors.

Sentence/document embeddings

Sentence-transformers, OpenAI ada-002 — turn whole text into single vector.

Useful for similarity, clustering, retrieval.

Modern approach

For most tasks: use a pretrained transformer's embedding layer. Sentence-transformers for sentence-level work.

Common NLP tasks

Classification

Sentiment, topic, intent, spam.

Modern: fine-tune a transformer on labeled data. Or use embeddings + classifier.

Named entity recognition (NER)

Find names, places, organizations, dates.

Pretrained models (spaCy, BERT-based) work well out of the box.

Sequence labeling

Part-of-speech tagging, chunking, parsing.

Mostly solved problems with pretrained models.

Information extraction

Extract structured info from text. Often combines NER + relation extraction.

Topic modeling

Discover themes in document collections.

LDA (classical), BERTopic (modern with embeddings).

Summarization

Abstract or extract summaries.

LLMs handle this well.

Question answering

Extract answers from documents (extractive) or generate answers (generative).

Search/retrieval

Find relevant documents for a query.

BM25 (classical), dense retrieval (modern), hybrid (best in practice).

Practical workflow

Start simple

TF-IDF + logistic regression is a baseline that works surprisingly well. Establish before going complex.

Use pretrained when possible

Fine-tuning a pretrained model beats training from scratch with limited data.

Embeddings + classifier

For many tasks: pre-compute embeddings, train a classifier on top.

Cheaper than fine-tuning. Often comparable quality.

Consider LLM for prototyping

For prototyping: zero-shot or few-shot LLM may answer the question.

For production: usually fine-tune a smaller model for cost.

Working with multilingual data

- mBERT, XLM-R: multilingual transformers

- Translation as preprocessing

- Language-specific tokenizers

Be wary of "English-trained model on other language" — quality drops.

Common pitfalls

Tokenizer mismatch

Using one tokenizer in training and another in inference. Subtle, often silent.

Ignoring class imbalance

Spam detection has 99% non-spam. Accuracy 99% by predicting "not spam" always.

Overfitting on small datasets

Aggressive regularization, data augmentation, or smaller models.

Distribution shift

Training on news; deploying on tweets. Doesn't transfer well.

Test contamination

Especially for retrieval evaluation: ensure test queries aren't in training data.

Treating language as solved

NLP works well for English on common domains. Specialized text (medical, legal, code) needs domain adaptation.

Tools

- **NLTK / spaCy**: classical NLP, fast preprocessing

- **scikit-learn**: TF-IDF, classical classifiers

- **Hugging Face Transformers**: pretrained models, fine-tuning

- **Sentence Transformers**: embeddings

- **gensim**: word2vec, topic modeling

- **OpenAI/Anthropic APIs**: zero/few-shot

Evaluation

Beyond accuracy: precision, recall, F1.

For text generation: BLEU, ROUGE, METEOR (imperfect), and human evaluation (gold standard).

For retrieval: NDCG, MRR, recall@k.

Don't trust automated metrics blindly. Sample outputs and read them.

When to use LLMs vs traditional ML

LLMs:

- Few labeled examples

- Complex reasoning required

- Prototyping

- Diverse, open-ended tasks

Traditional ML:

- Lots of labeled data

- Latency-sensitive

- Cost-sensitive at scale

- Simple, focused tasks

Hybrid: use LLMs to label data; train smaller model.

Further Reading

- [NLPOverview](NLPOverview) — Broader NLP context

- [SentimentAnalysisWithMachineLearning](SentimentAnalysisWithMachineLearning) — Sentiment specifically

- [TransformerArchitecture](TransformerArchitecture) — The model architecture

- [ML Hub](MLHub) — Cluster index