Tag: path-alignment

19 topic(s)

Synthetic Data Generation for Post-TrainingModern instruction-tuning and RL-based alignment rely on LLM-generated synthetic data: self-instruct / Evol-Instruct expand seed prompts, teacher models produce high-quality completions, and process-reward models validate chain-of-thought steps. The backbone of the post-ChatGPT post-training stack.
Self-Rewarding Language ModelsTrain a model to both generate responses and judge them, using the same weights. Each iteration: generate a pool of candidates, self-rate them, extract preference pairs, DPO-train on the pairs. Recursive improvement without external reward data; bottlenecks surface around judgement quality and diversity collapse.
ORPO (Odds-Ratio Preference Optimization)A reference-model-free preference-optimisation objective that combines SFT and preference learning in one loss: \( \mathcal{L}_{\text{ORPO}} = \mathcal{L}_{\text{SFT}} - \lambda \log \sigma(\log \text{odds}(y_w) - \log \text{odds}(y_l)) \). Eliminates DPO's reference-model requirement, halving training memory.
SimPO (Simple Preference Optimization)Reference-free preference objective that replaces DPO's reference ratio with a length-normalised policy log-likelihood: \( r(x, y) = \log \pi_\theta(y\mid x) / |y| \). Adds a margin \( \gamma \) to the preferred response. Simpler than DPO, matches or beats it, and length-normalisation reduces verbosity exploitation.
IPO (Identity Preference Optimization)Replaces DPO's sigmoid objective with a squared-error criterion on preference probabilities: \( \mathcal{L}_{\text{IPO}} = \mathbb{E}[(h_\theta(y_w, y_l) - 1/(2\beta))^2] \). Prevents DPO's tendency to over-separate preferred and rejected responses on easy pairs, reducing overfitting and improving generalisation.
RLVR: RL from Verifiable RewardsTrain a reasoning policy with pure RL signals from tasks whose answers are automatically verifiable — math (exact match), code (unit-test execution), proof (checker), chess (engine eval). No preference model needed. The method behind DeepSeek-R1-Zero and the o1-style long-CoT reasoning families.
Reward Hacking & Specification Gaming in RLHFWhen a learned reward model is a proxy for human preference, RL optimisation finds adversarial inputs that maximise reward without matching the true objective — verbose apologies, sycophancy, confident wrong answers, format exploitation. Goodhart's law in practice. Mitigations range from KL penalty and reward normalisation to process supervision and debate.
Refusal Training & Harmlessness ObjectivesTeach a model to decline unsafe or out-of-policy queries while remaining helpful on benign ones. Typical recipe: SFT on curated refusal examples, preference optimisation with harm-labelled pairs, red-team-driven iteration. Badly done, this causes over-refusal (the 'brittle helpfulness' regime); done well, it achieves high refusal rate with minimal utility loss.
Transcoders & Sparse CrosscodersTranscoders and sparse crosscoders are interpretability models that learn sparse dictionaries linking features across layers rather than explaining one layer in isolation. They are used to trace how a concept is transformed, preserved, or split as it moves through a network.
Causal Scrubbing & Mediation AnalysisRigorous protocols for validating interpretability hypotheses. Causal scrubbing replaces the hypothesised-irrelevant computations with samples from a distribution that should preserve the output; mediation analysis tests whether a candidate component mediates the causal effect of an input on an output. Tools for turning 'this feature looks meaningful' into falsifiable claims.
Circuit Discovery Pipelines (ACDC, Attribution Patching)Automated methods to locate the minimal sub-graph of attention heads and MLP components responsible for a given behaviour. ACDC greedily ablates edges in a causal graph, keeping only those whose removal degrades the behaviour; attribution patching approximates this with a single forward-backward pass per hypothesis.
Representation Engineering & the Refusal DirectionShift model behaviour by directly manipulating residual-stream activations along interpretable directions. The 'refusal direction' (Arditi et al. 2024) is a single direction in activation space whose ablation jailbreaks open-weight chat models, and whose injection forces refusal — evidence that safety training installs a shallow, targeted feature.
Concept Erasure & Null-Space ProjectionRemove a protected concept (gender, ethnicity, refusal, a specific memory) from representations by iteratively projecting activations onto the null space of linear classifiers for that concept. Achieves provable linear guarding of downstream use against the erased attribute, with bounded utility loss.
Membership Inference Attacks (MIA)Determine whether a specific example was in the training set of a deployed model. Attacks exploit loss / confidence gaps between seen and unseen examples — trained models are typically more confident on memorised training points. A baseline for privacy leakage in ML systems.
Model Stealing & Extraction AttacksModel stealing attacks recover a useful copy of a deployed model by querying it and training a substitute on the outputs. Extraction attacks go further and try to recover hidden parameters, decision rules, or embeddings directly, which matters for both proprietary models and privacy-sensitive systems.
Data Poisoning & Backdoor AttacksInsert malicious training examples so the model learns a targeted behaviour — misclassification on a trigger pattern, backdoored refusal bypasses, or degraded accuracy on specific classes. BadNets demonstrated pixel-trigger backdoors; modern LLM poisoning targets alignment-layer susceptibilities and pretraining data.
LLM Watermarking (Kirchenbauer et al.)Embed a statistical signature into generated text that is invisible to humans but detectable by an algorithm with the watermarking secret. Kirchenbauer et al. (2023) partition the vocabulary into a pseudo-random green / red list per step, biasing generation toward green; later detection uses a \( z \)-test on green-token frequency.
Dangerous-Capability Evaluations (Bio, Cyber, Persuasion, Autonomy)Dangerous-capability evaluations are targeted tests for whether a model can meaningfully assist with high-consequence harms such as bio misuse, cyber offense, persuasive manipulation, or autonomous scheming. They are used as deployment-gating evidence because ordinary benchmark gains do not tell you whether a model has crossed a safety-relevant threshold.
Prompt Injection: Taxonomy & DefencesAdversarial instructions embedded in model-accessible content — tool outputs, retrieved documents, emails — that override the user's original task. Direct (in user prompt) vs indirect (in external content). Defences include input filtering, dual-model separation, and structured prompt templates; none is a complete solution.