Tag: product

58 topic(s)

Constitutional Classifiers++Constitutional Classifiers++ is a production-oriented jailbreak defense that uses context-aware classifiers and a cascade of cheap and expensive checks to block harmful exchanges efficiently. The system is designed to keep refusal rates and serving cost low while still catching universal jailbreaks that earlier, response-only filters missed.
Chain-of-Thought MonitorabilityChain-of-thought monitorability is the safety claim that when a model needs explicit reasoning to complete a task, its written chain of thought can be monitored for harmful intent or deception. The key property is monitorability rather than perfect faithfulness: hiding the reasoning tends to become harder when the reasoning itself is load-bearing for success.
ImageNet DatasetImageNet is a large, hierarchically labeled image dataset whose 1000-class ILSVRC benchmark became the defining testbed for modern computer vision. AlexNet's 2012 win on ImageNet triggered the deep learning shift by showing that GPU-trained CNNs could dramatically beat hand-engineered pipelines.
Open-Weight ModelAn open-weight model is a model whose trained weights are publicly released for download and local use. That is more specific than 'open source': the weights may be open even when the training data, code, or full recipe are not.
Needle in a HaystackNeedle in a Haystack is a long-context benchmark that tests whether a model can retrieve a small target fact embedded inside a large distractor context. It is useful for measuring position-sensitive retrieval, but strong needle scores do not guarantee broader long-document reasoning.
Vector DatabaseA vector database is a system optimized for storing embeddings and retrieving nearest neighbors together with metadata filtering, updates, and persistence. It is the common serving layer behind semantic search and many RAG systems.
Inference OptimizationInference optimization is the set of techniques that reduce serving latency, memory use, and cost while preserving acceptable quality. Common methods include quantization, batching, KV-cache optimizations, kernel fusion, speculative decoding, and architecture choices that trade a little flexibility for much higher throughput.
Benchmark (ML Evaluation)A benchmark in ML evaluation is a standardized task, dataset, metric, and protocol used to compare systems reproducibly. Benchmarks are useful because they make progress measurable, but they can be gamed, saturated, or misaligned with real-world performance.
Human EvaluationHuman evaluation uses people to judge outputs on qualities such as helpfulness, factuality, coherence, or safety that automated metrics often miss. It is usually the most trustworthy evaluation for subjective tasks, but it is expensive, slow, and sensitive to rubric design and annotator variance.
Automated EvaluationAutomated evaluation scores model outputs with metrics or model-based judges instead of human raters. It is fast, scalable, and reproducible, but its usefulness depends on how well the metric correlates with the human judgment that actually matters.
BLEUBLEU is a machine-translation metric based mainly on n-gram precision against one or more reference texts, combined with a brevity penalty. It is useful for corpus-level comparison, but it often misses meaning-preserving paraphrases and is weak as a sentence-level quality measure.
A/B Testing (ML Systems)A/B testing in ML systems is a randomized online experiment that serves different model variants to different user groups and compares outcome metrics. It is the standard way to measure real production impact, because offline wins do not always translate into better user experience.
AI SafetyAI safety is the broader field concerned with preventing harmful or catastrophic outcomes from advanced AI systems. It includes alignment, robustness, misuse prevention, monitoring, control, and governance, so it is wider than just making a chatbot refuse bad requests.
Alignment (AI)Alignment in AI is the problem of making an AI system’s objectives and behavior match human intentions and values rather than a flawed proxy. The hard part is not only teaching what humans say they want, but ensuring the system pursues that goal robustly in new situations.
Foundation ModelA foundation model is a large general-purpose model pretrained on broad data and then adapted to many downstream uses through prompting, fine-tuning, or tool use. Its defining property is transfer: one base model can support many tasks rather than being built for just one.
Conversational AIConversational AI is a class of systems designed for multi-turn interaction, where the model must respond helpfully while tracking context, intent, and dialogue state. The hard part is not generating one good answer, but remaining coherent and useful across an extended interaction.
Shadow DeploymentShadow deployment runs a new model in production alongside the live system without letting its outputs affect users. This makes it possible to compare latency, quality, and failure modes on real traffic before committing to a risky rollout.
Feedback Loop (ML Systems)A feedback loop in an ML system occurs when the model’s outputs change the data it will later train on or be evaluated against. These loops can reinforce bias, distort demand, and make offline metrics look better even while the real system gets worse.
Transparency (AI Systems)Transparency in AI systems means making system behavior, limitations, provenance, and decision pathways inspectable to users, developers, or regulators. It is broader than interpretability because it includes documentation, reporting, and operational visibility, not just internal model analysis.
RLAIF (RL from AI Feedback)RLAIF replaces human preference labels with judgments produced by another AI model following a rubric. It scales alignment data collection much more cheaply than RLHF, but it also transfers the biases and blind spots of the judge model into the training signal.
Reasoning Models (o1 / R1-style Long-CoT)Reasoning models in the o1 or R1 style are language models trained or prompted to spend extra inference compute on long multi-step reasoning before answering. Their key idea is that better reasoning can come not only from bigger models, but from better search, verification, and credit assignment at inference and post-training time.
vLLM & Continuous BatchingvLLM is an LLM serving system built around PagedAttention and continuous batching. Instead of waiting for a batch to finish, it admits and schedules requests at each decoding step, which reduces padding waste and improves throughput for variable-length generations.
Core LLM Benchmarks (MMLU, HumanEval, GSM8K, MATH)MMLU tests broad academic knowledge, HumanEval tests code generation by unit tests, GSM8K tests grade-school math word problems, and MATH tests harder symbolic reasoning. Together they cover knowledge, code, and reasoning, but all can be gamed or saturated, so they are only a partial view of model quality.
LMSYS Chatbot ArenaLMSYS Chatbot Arena is a crowdsourced pairwise-evaluation platform where users compare two anonymous models by chatting and voting. Its Elo-style ranking captures interactive preference better than a single benchmark, but it is noisy and sensitive to traffic mix and prompt selection.
Stable Diffusion PipelineA text-to-image pipeline composed of (i) a VAE that compresses pixels to a 64×-smaller latent, (ii) a text encoder (CLIP) that provides conditioning, and (iii) a diffusion U-Net (or DiT) that denoises in latent space. All three pretrained components are glued by classifier-free guidance at inference.
Whisper (Speech-to-Text)OpenAI's 2022 encoder-decoder Transformer trained on 680k hours of weakly supervised multilingual audio-text pairs. Whisper performs speech recognition, translation, and voice-activity / language ID from a single model, with strong zero-shot robustness to noise, accent, and domain shift.
KV Cache Compression (H2O, SnapKV)Inference-time methods that shrink a long-context KV cache by evicting tokens that contribute little to future attention. H2O (Zhang et al., 2023) evicts by cumulative attention score; SnapKV (Li et al., 2024) observes that recent queries already reveal which past tokens matter, enabling one-shot pre-fill-time compression.
Chunked PrefillA serving-time technique that breaks the long prefill of a prompt into small chunks and interleaves them with decode steps of other requests. By keeping GPU utilisation high during prefill and avoiding long tail latencies, chunked prefill dramatically improves throughput in mixed-batch LLM serving.
Disaggregated Prefill/Decode ServingDisaggregated prefill/decode serving splits prompt processing and token-by-token decoding onto different GPU pools and transfers the KV cache between them. This reduces contention because prefill is throughput-heavy while decode is latency-sensitive, improving utilization in large serving clusters.
Paged vs Block KV CacheTwo allocation strategies for an LLM's growing KV cache. Block (contiguous) allocation pre-reserves the worst-case length per request and wastes memory. Paged (PagedAttention, vLLM 2023) allocates fixed-size pages on demand and chains them like OS virtual memory, yielding 2–4× higher batch-size at the cost of kernel-level bookkeeping.
Best-of-N Sampling and its ScalingAn inference-time boost: sample \( N \) responses from a language model, score them with a reward model or verifier, and return the best. Quality scales with \( \log N \); scaling laws predict the knob where best-of-\( N \) inference compute equals extra training compute, and motivate distillation of best-of-\( N \) behaviour back into the policy via RFT.
Weak-to-Strong GeneralizationOpenAI's analog for scalable oversight (Burns et al., 2023): can a strong model, fine-tuned on labels from a weaker supervisor, generalise beyond the supervisor's capability? Experiments on NLP and chess tasks show it partially can; the residual quality gap motivates future work on supervising superhuman models.
Tool Use / Function Calling Benchmarks (BFCL, τ-bench)BFCL (Berkeley Function Calling Leaderboard, 2024) and τ-bench (2024) evaluate LLMs' ability to select, parameterise, and sequence API calls. BFCL is single-turn function-call accuracy; τ-bench is multi-turn dialogue in simulated customer-service environments with realistic state and policy constraints.
Agent Benchmarks (SWE-bench, GAIA, WebArena)SWE-bench tests whether a model can fix real GitHub issues in code repositories, GAIA tests general tool-using problem solving with automatically checked answers, and WebArena tests web-navigation agents in simulated sites. Together they measure software, reasoning, and browser-action competence rather than just one-shot text generation.
XGBoost, LightGBM, and CatBoostThree production gradient-boosted-decision-tree implementations with distinct tree-construction strategies: XGBoost does level-wise exact/approximate splits with second-order Taylor objective; LightGBM uses histogram-based leaf-wise growth and GOSS subsampling; CatBoost uses ordered boosting to avoid target leakage with categorical features.
YOLO Family (v1 – v10)Single-stage detectors that divide the image into a grid and predict bounding boxes and class probabilities directly from a single CNN forward pass. YOLOv1 was real-time but coarse; YOLOv3–v10 progressively adopted anchor boxes, FPN, CSP blocks, decoupled heads, and finally anchor-free / NMS-free designs for edge deployment.
Data Curation & Quality Filters (FineWeb, Dolma)Modern pretraining pipelines filter terabytes of web data through language ID, heuristic rules (repetition, punctuation ratios), classifier-based quality scoring, and toxicity / PII removal. The FineWeb and Dolma recipes document which filters mattered — often delivering per-token quality gains equivalent to 2–3× scale-up.
Synthetic Data Generation for Post-TrainingModern instruction-tuning and RL-based alignment rely on LLM-generated synthetic data: self-instruct / Evol-Instruct expand seed prompts, teacher models produce high-quality completions, and process-reward models validate chain-of-thought steps. The backbone of the post-ChatGPT post-training stack.
Continual Pretraining & Mid-TrainingContinue pretraining an existing base model on a domain or task-focused corpus (code, math, a new language) before final post-training. Achieves domain gains that would cost 10× more to obtain by fine-tuning alone. Sits between pretraining and SFT in modern recipes.
Long-Context Data Recipes (RULER, Needle Variants)Extending effective context beyond 128k requires (a) RoPE-scaling or position-interpolation to keep positional encodings sane, (b) a continued-pretraining dataset with real long documents and synthetic stitched tasks, and (c) evaluation beyond simple needle-in-a-haystack — RULER adds multi-needle, multi-hop, and aggregation subtasks that expose superficial-match shortcuts.
Constrained Decoding: Grammars, JSON Mode, RegexGrammar-constrained decoding masks any next token that would violate a target grammar or schema. This guarantees outputs such as valid JSON, XML, or regex-matching strings by restricting generation to the language accepted by a finite-state machine or pushdown automaton.
Model Context Protocol (MCP)Model Context Protocol is an open client-server protocol for connecting models and agents to external tools, resources, and prompts through a common interface. It standardizes capability discovery and tool invocation so one MCP-aware client can talk to many different servers without bespoke integrations.
Agentic Workflows & Multi-Agent OrchestrationSystems that compose multiple LLM calls — planner, executor, critic, tool-user — into an end-to-end workflow. Patterns range from ReAct loops to fixed DAGs (LangGraph) to role-playing ensembles (AutoGPT, BabyAGI). Success requires careful handoff design, termination criteria, and cost control.
Radix / Prefix-Cache Attention (SGLang)Share the KV cache across requests that start with a common prompt prefix. Store prefix trees keyed by token sequence; on a new request, find the longest matching prefix in the cache and reuse it. Cuts prefill latency and memory use for chat applications with shared system prompts or few-shot contexts.
Quantized KV Cache (int4 / int8 / KIVI)Store the KV cache at lower precision — int8 or int4 — instead of fp16. Halves or quarters the memory footprint of long contexts at negligible quality cost. Different quantisation per key / value (K usually int8, V int4 via grouping) and per-head asymmetric scales are the main tricks.
Continuous vs Static BatchingStatic batching groups requests before a forward pass and runs them to completion together — tail latency is set by the slowest request. Continuous batching (Orca, vLLM) evicts finished requests mid-step and admits new ones each iteration, keeping GPU utilisation high and tail latency bounded. Default in production LLM serving.
Federated Learning (FedAvg)Train a shared model across many clients (phones, hospitals) without centralising data. Each round: clients train locally for a few epochs, server averages their weight updates. Introduces non-IID-data, communication-cost, and privacy / security challenges absent in centralised training.
Model Stealing & Extraction AttacksModel stealing attacks recover a useful copy of a deployed model by querying it and training a substitute on the outputs. Extraction attacks go further and try to recover hidden parameters, decision rules, or embeddings directly, which matters for both proprietary models and privacy-sensitive systems.
Data Poisoning & Backdoor AttacksInsert malicious training examples so the model learns a targeted behaviour — misclassification on a trigger pattern, backdoored refusal bypasses, or degraded accuracy on specific classes. BadNets demonstrated pixel-trigger backdoors; modern LLM poisoning targets alignment-layer susceptibilities and pretraining data.
LLM Watermarking (Kirchenbauer et al.)Embed a statistical signature into generated text that is invisible to humans but detectable by an algorithm with the watermarking secret. Kirchenbauer et al. (2023) partition the vocabulary into a pseudo-random green / red list per step, biasing generation toward green; later detection uses a \( z \)-test on green-token frequency.
Dangerous-Capability Evaluations (Bio, Cyber, Persuasion, Autonomy)Dangerous-capability evaluations are targeted tests for whether a model can meaningfully assist with high-consequence harms such as bio misuse, cyber offense, persuasive manipulation, or autonomous scheming. They are used as deployment-gating evidence because ordinary benchmark gains do not tell you whether a model has crossed a safety-relevant threshold.
Prompt Injection: Taxonomy & DefencesAdversarial instructions embedded in model-accessible content — tool outputs, retrieved documents, emails — that override the user's original task. Direct (in user prompt) vs indirect (in external content). Defences include input filtering, dual-model separation, and structured prompt templates; none is a complete solution.
Text-to-Image: DALL-E Lineage & ImagenAutoregressive (DALL-E 1, Parti) vs diffusion (DALL-E 2, DALL-E 3, Imagen, Stable Diffusion, Flux) lineages for prompt-to-pixel generation. DALL-E 3 uses a specialised caption-rewriting stage; Imagen emphasises text-encoder scale (T5-XXL) as the dominant quality lever.
Unified Multimodal Models (GPT-4o / Gemini any-to-any)Single models that process and generate multiple modalities — text, image, audio, video — through a shared backbone with per-modality tokenisers. Native multimodal training yields far richer cross-modal reasoning than cascaded pipelines: image understanding in context of speech, audio generation from visual cues, unified embeddings.
Serving LLMs at ScaleServing LLMs at scale is a systems problem of jointly optimizing prompt prefill throughput, token-by-token decode latency, KV-cache memory, batching policy, and fleet utilization. Modern serving stacks rely on continuous batching, prefix caching, PagedAttention, speculative decoding, and sometimes prefill/decode disaggregation to keep both tail latency and GPU cost under control.
Dataset Versioning & LineageDataset versioning and lineage track exactly which raw data, labels, transformations, and filters produced a training or evaluation set. They matter because reproducibility, compliance, rollback, and debugging all depend on being able to answer "which data built this model?" with more precision than a folder name or timestamp.
Feature StoresFeature stores are systems for defining, computing, and serving reusable machine-learning features consistently across training and production. Their core promise is point-in-time correctness and train/serve consistency: the feature a model saw offline should match the feature served online for the same entity and timestamp.
ML System Monitoring & Drift DetectionML system monitoring tracks whether a deployed model is still receiving the kind of data it was built for and whether its business and technical behavior remain acceptable. Drift detection is one part of that job: teams also monitor latency, calibration, feature freshness, label delay, feedback loops, and downstream outcomes, because data drift alone does not tell the whole production story.