Tag: cv

36 topic(s)

ImageNet DatasetImageNet is a large, hierarchically labeled image dataset whose 1000-class ILSVRC benchmark became the defining testbed for modern computer vision. AlexNet's 2012 win on ImageNet triggered the deep learning shift by showing that GPU-trained CNNs could dramatically beat hand-engineered pipelines.
Convolutional neural network (CNN)A convolutional neural network uses learned convolution filters with local receptive fields and weight sharing to process grid-like data such as images. Those inductive biases make CNNs especially effective and parameter-efficient for visual pattern recognition.
Vision Language Model (VLM)A vision-language model jointly processes images and text so it can describe, answer questions about, or reason across both modalities. Most VLMs combine a vision encoder with a language model through projection layers, cross-attention, or joint multimodal pretraining.
Vision EncoderA vision encoder maps an image into features or tokens that downstream modules can use for classification, retrieval, or generation. CNNs and Vision Transformers are common vision encoders, differing mainly in how they represent spatial structure.
CLIP (Contrastive Language-Image Pre-training)CLIP learns a shared embedding space for images and text by pulling matched image-caption pairs together and pushing mismatched pairs apart. This contrastive objective enables zero-shot classification by comparing an image embedding against text prompts for candidate labels.
He Initialization (Kaiming Initialization)He initialization sets weight variance to roughly 2/fan-in so ReLU-like activations preserve signal magnitude through depth. It improves on Xavier initialization for one-sided activations that zero out about half the inputs.
Vision Transformer (ViT)Dosovitskiy et al. (2020) showed that a pure Transformer applied to fixed-size image patches as tokens matches or exceeds state-of-the-art CNNs on ImageNet when pretrained on enough data. ViT is the backbone of modern vision-language models (CLIP, SigLIP, DINOv2, MAE) and the foundation of nearly all 2020s visual representation work.
Masked Autoencoder (MAE)A self-supervised ViT pretraining objective: randomly mask 75% of image patches and train an asymmetric encoder–decoder to reconstruct pixel values from the visible 25%. MAE is simple, compute-efficient (the encoder sees only unmasked patches), and produces state-of-the-art ImageNet fine-tuning representations.
DINOv2A self-supervised ViT pretraining recipe from Meta (Oquab et al., 2023) that combines a DINO-style self-distillation objective with an iBOT masked-patch prediction objective and a curated 142M-image dataset. DINOv2 produces general-purpose frozen visual features that outperform task-specific supervised baselines on classification, segmentation, depth, and correspondence.
SigLIPA contrastive image–text pretraining method (Zhai et al., 2023) that replaces CLIP's softmax-over-batch contrastive loss with a pairwise sigmoid binary cross-entropy. SigLIP removes the need for large global batches, scales batch-size-efficiently, and achieves CLIP-level or better zero-shot accuracy at a fraction of the training compute.
Diffusion Transformers (DiT)Peebles & Xie (2022) replace the U-Net backbone of latent diffusion with a standard Transformer over VAE-latent patches. DiT scales predictably with compute, matches or exceeds U-Net quality, and is the architectural backbone of Stable Diffusion 3, Sora, and most frontier text-to-image/video diffusion models.
Stable Diffusion PipelineA text-to-image pipeline composed of (i) a VAE that compresses pixels to a 64×-smaller latent, (ii) a text encoder (CLIP) that provides conditioning, and (iii) a diffusion U-Net (or DiT) that denoises in latent space. All three pretrained components are glued by classifier-free guidance at inference.
Contrastive Learning (SimCLR / MoCo)Self-supervised visual representation learning via the InfoNCE loss: pull together two augmented views of the same image (positives) while pushing apart views of all other images (negatives). SimCLR uses in-batch negatives; MoCo uses a queued momentum encoder, enabling large effective negative pools with small batches.
BYOL / Self-DistillationBootstrap Your Own Latent (Grill et al., 2020) shows that strong visual representations can be learned without negatives : an online network predicts the output of a momentum-updated target network on a different augmentation of the same image. The same template underlies DINO, MoCo-v3, and other non-contrastive SSL methods.
JEPA (Joint Embedding Predictive Architecture)LeCun's self-supervised template (2022) that predicts the representation of a target from the representation of a context, rather than predicting the target itself. By regressing in embedding space, JEPA avoids wasting capacity on irrelevant per-pixel detail; I-JEPA and V-JEPA are concrete instantiations for images and video.
Vision-Language Contrastive Objectives Beyond CLIPSuccessors to CLIP refine its symmetric InfoNCE loss for better efficiency, finer-grained alignment, and scaling. SigLIP replaces softmax with a pairwise sigmoid; LiT freezes a pretrained image tower; ALIGN scales noisily; FILIP does token-level contrast; CoCa adds a captioning head. All share the joint-embedding template.
Residual Networks (ResNet as Architecture)A residual network replaces a plain layer stack with blocks that learn a residual update \(F(x)\) and add it back to the input, so each block computes \(y = x + F(x)\). This makes very deep CNNs much easier to optimize and triggered the shift from VGG-style stacks to residual architectures.
LLaVA: Visual Instruction TuningLLaVA (Liu et al., 2023) connects a frozen CLIP vision encoder to a frozen LLM via a small learned projection, then instruction-tunes the combined model on GPT-4-generated multimodal dialogues. The recipe is minimal — a single linear (later two-layer MLP) projector — yet competitive with closed VLMs, establishing the canonical open VLM pattern: (pretrained vision encoder) + (learned bridge) + (pretrained LLM) + (visual instruction data).
GAN Family: WGAN, StyleGAN, BigGANThree architectural and objective milestones: WGAN uses the Kantorovich–Rubinstein dual of \( W_1 \) as a smoother critic, StyleGAN introduces AdaIN-controlled style injection for image generation, BigGAN scales class-conditional GANs to 512×512 with orthogonal regularisation and truncation tricks.
U-Net ArchitectureA fully-convolutional encoder–decoder with symmetric skip connections between contracting and expanding paths. Designed for biomedical segmentation; now the standard backbone of Stable Diffusion and most pixel-to-pixel models because skip connections preserve spatial detail across downsampling.
Semantic & Instance SegmentationSemantic segmentation assigns a class label to every pixel; instance segmentation further distinguishes object instances (two cats become two masks). Panoptic segmentation unifies them: one label per pixel with 'thing' vs 'stuff' classes. Backbones: FCN, DeepLab, Mask R-CNN, DETR/Mask2Former.
Object Detection: R-CNN → Faster R-CNN → DETRR-CNN ran a CNN classifier on externally-proposed regions; Fast R-CNN shared backbone features across proposals; Faster R-CNN introduced a learned Region Proposal Network; DETR replaced the entire region-proposal pipeline with a transformer that predicts a fixed set of boxes via bipartite matching.
YOLO Family (v1 – v10)Single-stage detectors that divide the image into a grid and predict bounding boxes and class probabilities directly from a single CNN forward pass. YOLOv1 was real-time but coarse; YOLOv3–v10 progressively adopted anchor boxes, FPN, CSP blocks, decoupled heads, and finally anchor-free / NMS-free designs for edge deployment.
Neural Radiance Fields (NeRF) & 3D Gaussian SplattingNeRF encodes a 3-D scene as a continuous function \( (x, y, z, \theta, \phi) \to (\text{colour}, \text{density}) \) queried along camera rays and volume-rendered into pixels. 3D Gaussian Splatting replaces the implicit MLP with an explicit set of anisotropic Gaussians rasterised in real time.
PointNet & 3D Deep Learning on Point CloudsPointNet processes an unordered point set by applying a shared MLP to each point, then pooling across points with a symmetric function (max-pool). Permutation-invariant by construction; PointNet++ adds local-region hierarchies to capture geometric structure.
Siamese Networks & Metric LearningTrain a shared encoder so that semantically similar inputs map to nearby embeddings (contrastive / triplet loss) or that query-key scores reflect similarity directly. Used in face verification, signature matching, image retrieval; the conceptual parent of SimCLR, CLIP, and BiEncoder retrieval.
Capsule NetworksHinton's alternative to CNN pooling: neurons are grouped into 'capsules' whose vector output encodes both existence and pose of an entity. Dynamic routing by agreement replaces max-pool, so each capsule decides which higher-level capsule to vote for based on agreement. Historically significant; practically superseded by transformers.
Focal Loss & Class-Imbalance ObjectivesFocal loss \( \text{FL}(p_t) = -(1-p_t)^\gamma \log p_t \) down-weights the loss contribution of confidently-classified easy examples, focusing gradient on hard ones. Designed for extreme foreground-background imbalance in one-stage detection; widely used in segmentation and long-tailed classification.
Text-to-Image: DALL-E Lineage & ImagenAutoregressive (DALL-E 1, Parti) vs diffusion (DALL-E 2, DALL-E 3, Imagen, Stable Diffusion, Flux) lineages for prompt-to-pixel generation. DALL-E 3 uses a specialised caption-rewriting stage; Imagen emphasises text-encoder scale (T5-XXL) as the dominant quality lever.
Unified Multimodal Models (GPT-4o / Gemini any-to-any)Single models that process and generate multiple modalities — text, image, audio, video — through a shared backbone with per-modality tokenisers. Native multimodal training yields far richer cross-modal reasoning than cascaded pipelines: image understanding in context of speech, audio generation from visual cues, unified embeddings.
Video Diffusion (Sora, Veo, Gen-3)Extend image-diffusion recipes to video with 3D patch embeddings, temporal attention, and long-context handling. Sora (OpenAI), Veo (Google), and Gen-3 (Runway) train DiT-style transformers over space-time patches of 1–60 second clips, conditioning on rich text captions for controllable generation.
Text-to-Image AlignmentText-to-image (T2I) alignment is the task of making generated images faithfully follow textual prompts — covering spatial layout, attribute binding, count, and style. Modern alignment relies on contrastive image–text encoders (CLIP, SigLIP, T5) injected via cross-attention into a diffusion or flow backbone, plus classifier-free guidance, RLHF-style preference fine-tuning, and reward models that grade prompt adherence.
Generative Model Evaluation (FID, IS, and their limits)Fréchet Inception Distance (FID) and Inception Score (IS) are the standard automated metrics for image generative models; both rely on Inception-v3 features and have well-known biases. Modern T2I evaluation supplements them with CLIPScore, prompt-adherence benchmarks (T2I-CompBench, GenEval), human-preference Elo (ImageReward, HPS), and likelihood / NLL where applicable.
Neural Fields / Implicit Neural RepresentationsA neural field represents a continuous signal with a neural network that maps coordinates to values such as color, density, or signed distance. This makes the model itself a compact continuous representation of an image, shape, or scene, with NeRF as the best-known example.
Vision Transformer (ViT) VariantsSince the original ViT, a wide family of variants has emerged that improve data efficiency, locality, hierarchy, and pretraining objective. The most influential are DeiT (training recipe), Swin (windowed hierarchical attention), MAE (masked-image pretraining), DINOv2 (self-distilled features), and SigLIP (sigmoid contrastive pretraining). Each addresses a specific weakness of the vanilla ViT.
AlexNetAlexNet was the deep convolutional network that won ILSVRC 2012 by a huge margin and triggered the modern deep-learning wave in vision. Its impact came from the full recipe—ImageNet-scale data, GPU training, ReLU, dropout, and augmentation—not from a single isolated trick.