Tag: path-vision

21 topic(s)

ImageNet DatasetImageNet is a large, hierarchically labeled image dataset whose 1000-class ILSVRC benchmark became the defining testbed for modern computer vision. AlexNet's 2012 win on ImageNet triggered the deep learning shift by showing that GPU-trained CNNs could dramatically beat hand-engineered pipelines.
Reparameterization Trick (VAE)The reparameterization trick writes a stochastic latent sample as a differentiable transformation of parameters and noise, typically z equals mu plus sigma times epsilon. This lets gradients flow through sampling and makes variational autoencoder training practical with backpropagation.
GAN Minimax ObjectiveThe GAN minimax objective sets up a two-player game in which a generator tries to produce samples that fool a discriminator, while the discriminator tries to distinguish real from generated data. At equilibrium the generator matches the data distribution, though the training game is often unstable in practice.
Unsupervised learningUnsupervised learning tries to discover structure in data without labeled targets, such as clusters, latent factors, or a density model. It is used for representation learning, dimensionality reduction, clustering, and generative modeling when explicit supervision is unavailable.
Convolutional neural network (CNN)A convolutional neural network uses learned convolution filters with local receptive fields and weight sharing to process grid-like data such as images. Those inductive biases make CNNs especially effective and parameter-efficient for visual pattern recognition.
Vision Language Model (VLM)A vision-language model jointly processes images and text so it can describe, answer questions about, or reason across both modalities. Most VLMs combine a vision encoder with a language model through projection layers, cross-attention, or joint multimodal pretraining.
Vision EncoderA vision encoder maps an image into features or tokens that downstream modules can use for classification, retrieval, or generation. CNNs and Vision Transformers are common vision encoders, differing mainly in how they represent spatial structure.
CLIP (Contrastive Language-Image Pre-training)CLIP learns a shared embedding space for images and text by pulling matched image-caption pairs together and pushing mismatched pairs apart. This contrastive objective enables zero-shot classification by comparing an image embedding against text prompts for candidate labels.
Latent SpaceA latent space is the internal feature space in which a model represents inputs after transformation, often in a form that is more compact or task-relevant than raw data. Distances or directions in latent space can encode meaningful variation, but only relative to the model and objective that learned it.
He Initialization (Kaiming Initialization)He initialization sets weight variance to roughly 2/fan-in so ReLU-like activations preserve signal magnitude through depth. It improves on Xavier initialization for one-sided activations that zero out about half the inputs.
Vision Transformer (ViT)Dosovitskiy et al. (2020) showed that a pure Transformer applied to fixed-size image patches as tokens matches or exceeds state-of-the-art CNNs on ImageNet when pretrained on enough data. ViT is the backbone of modern vision-language models (CLIP, SigLIP, DINOv2, MAE) and the foundation of nearly all 2020s visual representation work.
Masked Autoencoder (MAE)A self-supervised ViT pretraining objective: randomly mask 75% of image patches and train an asymmetric encoder–decoder to reconstruct pixel values from the visible 25%. MAE is simple, compute-efficient (the encoder sees only unmasked patches), and produces state-of-the-art ImageNet fine-tuning representations.
DINOv2A self-supervised ViT pretraining recipe from Meta (Oquab et al., 2023) that combines a DINO-style self-distillation objective with an iBOT masked-patch prediction objective and a curated 142M-image dataset. DINOv2 produces general-purpose frozen visual features that outperform task-specific supervised baselines on classification, segmentation, depth, and correspondence.
SigLIPA contrastive image–text pretraining method (Zhai et al., 2023) that replaces CLIP's softmax-over-batch contrastive loss with a pairwise sigmoid binary cross-entropy. SigLIP removes the need for large global batches, scales batch-size-efficiently, and achieves CLIP-level or better zero-shot accuracy at a fraction of the training compute.
Contrastive Learning (SimCLR / MoCo)Self-supervised visual representation learning via the InfoNCE loss: pull together two augmented views of the same image (positives) while pushing apart views of all other images (negatives). SimCLR uses in-batch negatives; MoCo uses a queued momentum encoder, enabling large effective negative pools with small batches.
BYOL / Self-DistillationBootstrap Your Own Latent (Grill et al., 2020) shows that strong visual representations can be learned without negatives : an online network predicts the output of a momentum-updated target network on a different augmentation of the same image. The same template underlies DINO, MoCo-v3, and other non-contrastive SSL methods.
JEPA (Joint Embedding Predictive Architecture)LeCun's self-supervised template (2022) that predicts the representation of a target from the representation of a context, rather than predicting the target itself. By regressing in embedding space, JEPA avoids wasting capacity on irrelevant per-pixel detail; I-JEPA and V-JEPA are concrete instantiations for images and video.
Vision-Language Contrastive Objectives Beyond CLIPSuccessors to CLIP refine its symmetric InfoNCE loss for better efficiency, finer-grained alignment, and scaling. SigLIP replaces softmax with a pairwise sigmoid; LiT freezes a pretrained image tower; ALIGN scales noisily; FILIP does token-level contrast; CoCa adds a captioning head. All share the joint-embedding template.
Residual Networks (ResNet as Architecture)A residual network replaces a plain layer stack with blocks that learn a residual update \(F(x)\) and add it back to the input, so each block computes \(y = x + F(x)\). This makes very deep CNNs much easier to optimize and triggered the shift from VGG-style stacks to residual architectures.
LLaVA: Visual Instruction TuningLLaVA (Liu et al., 2023) connects a frozen CLIP vision encoder to a frozen LLM via a small learned projection, then instruction-tunes the combined model on GPT-4-generated multimodal dialogues. The recipe is minimal — a single linear (later two-layer MLP) projector — yet competitive with closed VLMs, establishing the canonical open VLM pattern: (pretrained vision encoder) + (learned bridge) + (pretrained LLM) + (visual instruction data).
AlexNetAlexNet was the deep convolutional network that won ILSVRC 2012 by a huge margin and triggered the modern deep-learning wave in vision. Its impact came from the full recipe—ImageNet-scale data, GPU training, ReLU, dropout, and augmentation—not from a single isolated trick.