Computer Vision Fundamentals

Computer vision teaches machines to extract meaning from images and video. Classical CV used hand-engineered features (edges, corners, SIFT). Modern CV is dominated by deep learning, particularly CNNs and increasingly transformers.

The shift from hand-engineered to learned features in 2012 (AlexNet) is the defining moment of modern CV.

Image representation

Images are arrays of pixels. A grayscale image is a 2D array; a color image is 3D (height × width × channels, usually RGB).

Common formats:

- JPEG: lossy, small files

- PNG: lossless, larger

- WebP: modern, good compression

- RAW: sensor data, large

For ML: typically loaded as float arrays, normalized (often to [0,1] or standardized to mean 0).

Classical techniques

Edge detection

Sobel, Canny — find intensity gradients. Still used for preprocessing.

Feature descriptors

SIFT, SURF, ORB — detect keypoints invariant to rotation and scale.

Used for:

- Image stitching (panoramas)

- Object matching

- SLAM (Simultaneous Localization and Mapping)

Image segmentation

Watershed, k-means clustering — partition images into regions.

Optical flow

Track pixel movement across video frames. Lucas-Kanade and Farnebäck are classics.

These techniques still matter for low-power, real-time, or interpretable systems.

Convolutional Neural Networks (CNNs)

Inspired by biological vision. Convolutional layers learn local features; pooling layers reduce dimension; fully-connected layers classify.

Key architectures:

- **LeNet** (1998): the prototype

- **AlexNet** (2012): the breakthrough — ImageNet error dropped dramatically

- **VGG** (2014): deeper, simpler

- **ResNet** (2015): skip connections enable very deep networks

- **EfficientNet** (2019): compound scaling for accuracy/efficiency

- **ConvNeXt** (2022): modern CNN competitive with transformers

ResNet remains a strong default for image classification.

Vision Transformers (ViTs)

Treat image as sequence of patches; apply transformer architecture.

ViTs need more data than CNNs to train from scratch but excel with pretraining.

Hybrid CNN-transformer architectures often work best in practice.

Common tasks

Image classification

Single label per image. The benchmark task; ImageNet is the canonical dataset.

Object detection

Find and classify multiple objects per image. Bounding boxes + labels.

Architectures:

- **YOLO** (You Only Look Once): real-time, fast

- **Faster R-CNN**: two-stage, accurate

- **DETR**: transformer-based, end-to-end

Semantic segmentation

Pixel-level classification. Each pixel gets a class label.

Architectures: U-Net, DeepLab, Mask R-CNN.

Instance segmentation

Like semantic but distinguishes individual objects.

Pose estimation

Find keypoints (joints, landmarks). Used for human pose, hand tracking, faces.

Image generation

GANs, diffusion models generate novel images. Stable Diffusion, DALL-E.

Pretraining and transfer learning

Most practical CV uses pretrained models:

- Take a model trained on ImageNet (or larger)

- Fine-tune on your task

This dramatically reduces data requirements. With 1000 examples, fine-tuning a pretrained model often beats training from scratch on millions.

Data augmentation

Synthetically increase training data:

- Rotation, scaling, cropping

- Color jitter, brightness changes

- Mixup, CutMix (combining images)

- Random erasing

Aggressive augmentation helps with limited data.

Deployment considerations

Inference speed

CNNs run efficiently on GPUs. For mobile/edge:

- Quantization (int8, int4)

- Pruning

- Distillation

- Architecture choices (MobileNet, EfficientNet)

Latency budgets

Real-time CV needs:

- 30+ FPS for video (33ms per frame)

- Sub-100ms for interactive applications

Memory

Model weights + activation memory. Affects where the model can run.

Common failure patterns

Distribution shift

Models trained on ImageNet may fail on rotated, low-light, or domain-specific images.

Adversarial examples

Tiny pixel perturbations can fool models. Robustness research is ongoing.

Bias in training data

Models reflect biases in data. Face recognition has had documented racial bias issues.

Overfitting on small datasets

Without enough data or augmentation, deep networks memorize training set.

Confusing classification confidence with calibration

Softmax outputs aren't well-calibrated probabilities by default.

Practical workflow

1. Define the task precisely

2. Gather and label data (often the hardest part)

3. Start with a pretrained model

4. Fine-tune with appropriate data augmentation

5. Evaluate on a held-out test set

6. Iterate on data, not just model

7. Profile for deployment constraints

The model is rarely the bottleneck. Data quality and quantity usually matter more.

Where CV is going

- Foundation models (CLIP, SAM) for general-purpose visual understanding

- Multimodal models (text + image)

- Video understanding (longer context)

- 3D understanding from 2D images

- Edge deployment improvements

Further Reading

- [TransformerArchitecture](TransformerArchitecture) — ViT foundation

- [MlModelDeployment](MlModelDeployment) — Getting models to production

- [InferenceServing](InferenceServing) — Serving infrastructure

- [ML Hub](MLHub) — Cluster index