Top Five Ideas, Spring 2024

Top Five Ideas from Every Class

May 2024

At junior year’s end, that cliff face of wondering and flight, I record again the top insight.

6.7800 Inference and Information

Exponential families are power. They come in the form P = { p_y(y; x) = exp(\lambda(x)t(y) - \alpha(x) + \beta(y) }. Lambda is the natural parameter, t is the natural statistic, alpha is the log partition function, and beta is the log base distribution. When your model family is exponential, life is good. You get a finite-dimensional sufficient statistic t. When \lambda(x) = x, the family is called canonical. Then \E[t(y)] = \dot{\alpha(x)}, where \dot{...} denotes a derivative; the Fisher information is J_y(x) = \ddot{\alpha(x)}, where \ddot denotes a second derivative; \alpha(x) is strictly convex; and the family has a conjugate prior family with natural statistic [\lambda(x) ; \alpha(x)]. Exponential families with parameters in \R^d just take a natural statistic t : \mathcal{Y} \to \R^d and use a dot product for \lambda(x)t(y).
Exponential families are geodesics in the probability simplex. The geometric mean p_0^{1-x} p_1^x / Z(x) connects p_0 to p_1. A linear family is \mathcal{L} = { p_y(y; x) : \E[t_k(y)] = t_k, k = 0, ..., K-1 }. An exponential family with natural statistic t matching a linear family is orthogonal to that linear family. Setting the log base function roots the exponential family to pass through a specific distribution q. An analog of Pythagoras's theorem holds: for all p \in \mathcal{L}, D(p||q) = D(p||p^*) + D(p^*||q) for p^* the I-projection of q on to \mathcal{L}, p^* = \argmin_{p' \in \mathcal{P}} D(p'||q). Gaussian, Poisson, Beta, Bernoulli, Geometric, and Categorical are all exponential families.
Entropy, mutual information, and divergence all tell a story. Entropy is the expected bits to describe a realization of a random variable. Mutual information is the expected extra bits required to describe a realization of X given a realization of Y. Divergence is the expected extra bits required to describe a probability distribution p given a draw from q. Data Processing Inequality says if x<-->y<-->t (statistic), then I(x; y) \geq I(x; t) with equality iff x<-->t<-->y (sufficient statistic), where the arrow notation denotes a Markov chain. Notation \mathcal{X} denotes alphabet of parameters, \mathcal{Y} denotes alphabet of data. Notation y_n denotes the nth in a sequence, while y^N denotes a sequence of N. For example, we could write p_{y^N | x} or p_{x | y^N}. It's useful to write subscripts on probability distributions, like p_{y|x}(y|x), so that the function itself is unique: p_{y|x}. It's useful to write straight font for random variables and curved font for realizations of random variables.
Universal inference says that we can converge to any probability distribution after learning from infinite data, so long as the true parameter exists in our model and model capacity grows sublinearly. Model capacity is C = \max_{p_x} I(x; y). The argmax prior p_x is called the least informative prior. The mixture q(x) = \sum_x p_x(x) p_y(y; x) over p_x satisfies equidistance: D(p(y; x)||q) = C whenever q(x) > 0 and |\mathcal{X}| < \infty. The universal predictor is a mixture over all parameters, weighted by likelihood of explaining the data seen so far. Interpretation: ask everyone you know; heed the advice according to each person's demonstrated prior success; listen to everybody, even if just a little; this way you will always converge to the actual truth, given time.
Typical sets let us quantify extreme events. A sequence is typical if its entropy is within \epsilon of the entropy of the distribution that generated it. By the weak law of large numbers, almost all sequences become typical. The probability of a sequence from p looking typical under q decays by 2^{-ND(p||q)}. Sanov's theorem: if data is generated from q, the chance of an extreme event occurring (y^N \in S) decays as 2^{-ND(p^*||q)}, where p^* = \argmin_{p \in cl(S)} D(p||q) is the I-projection of q on to S. Conditional limit theorem: the way the extreme event occurs is as p^*, asymptotically. Example: if we roll a die 1000 times and observe the sum of the rolls is 5000, it's extremely rare, but what is the likeliest way it happened? Answer: \argmin_{p \in cl(S)} D(p||q) for S = { p_y : \E[y] \geq 5 } and q = \Unif, over alphabet \mathcal{Y} = {1,2,3,4,5,6}.
Machine learning is all about finding the maximum likelihood estimator \hat{x}^{ML}. Optimizing cross-entropy causes it, where cross-entropy is really just an entropy plus a divergence. In an exponential family, the asymptotics go as \hat{x}^{ML} \to \mathcal{N}(x, 1/(NJ_y(x))); the ML estimate is the parameter that matches sample natural statistics. Cramér-Rao bound says an unbiased \hat{x} satisfies \var(\hat{x} ; x) \geq 1/J_y(x). But there are other estimators, too. To distinguish between hypotheses, you can do no better than to decide based on the likelihood ratio of the data under each hypothesis. The choice of cutoff reflects an operating frontier of tradeoffs between detection probability and miss probability, often chosen to optimize failure costs.

“Who’s excited to hear some more about Harrison and his clocks?” - Professor Greg Wornell

8.044 Statistical Physics

Microcanonical ensemble is closed, isolated system at constant energy: all microstates equally likely (Ergodic hypothesis). Canonical ensemble is a closed system thermally connected to a bath, with probability of exploring a microstate with energy E proportional to e^{-\beta E}. Grand canonical ensemble is a system thermally and permeably connected to a particle reservoir, with probability of exploring a microstate with energy E and particle count N proportional to e^{-\beta(E - \mu N)} for \mu the chemical potential (energy to add more particles). Insight: softmax from machine learning follows thermal equilibrium Boltzmann statistics: p(e) ~ e^{-E / k_B T}; logits are energies, and the user selects the temperature.
Four thermodynamic potentials: U is internal energy; H = U + PV is enthalpy, which includes the energy to push out of the way the atmosphere to make space for the object; F = U - TS is Helmholtz free energy, valid for constant volume; G = U + PV - TS is Gibbs free energy, valid for constant pressure (like chemistry experiments, vials open to the air). Reactions are spontaneous if and only if free energy decreases. Free energy includes a tax for decreasing entropy, forcing us to increase entropy of the surrounding reservoir more to sustain the second law of thermodynamics: entropy is nondecreasing. Entropy is S = k_B \log(#microstates). Temperature is 1/T = dS/dU holding volume fixed.
All of thermodynamics falls out of the partition function, Z = \sum_j e^{-\beta E_j}. Average internal energy is U = d ln(Z) / d(-\beta), just like cumulant generation from exponential families. Helmholtz free energy is F = -k_B T ln(Z). Since also F = U - TS, we can apply partial-derivative-jitsu to rewrite dF = dU - d(TS) = (-PdV + TdS) - (TdS + SdT) = -PdV - SdT, using the first law of thermodynamics that dU = d(heat) + d(work) = TdS - PdV. Reversible heat is Q = T dS, pneumatic work is -PdV. Therefore dF/dV|_T = -p and dF/dT_V = -S, with the bar notation to denote holding a variable fixed while taking the partial derivative. Similar partial-derivative-jitsu can recover innumerable relations between thermodynamic quantities. For example, dH = dU + d(PV) = (-PdV + Tds) + (PdV + VdP) = PdV + TdS; in conjunction with dH = (dH/dV|_S) dV + (dH/dS_V) dS, which comes because H is an exact differential by virtue of being a state function, we simply read out the matches to learn that P = dH/dV_S and T = dH/dS_V. These are known as Maxwell relations.
The most efficient heat engine (converting heat into work) has efficiency 1 - T_-/T_+. Hotter highs and lower lows increase efficiency. For typical T_+ = 500 K, T_- = 300 K, maximum efficiency would be 40%. Refrigerators are heat engines in reverse; they transport heat from a cold object into a hot object. Heat pumps for homes are the same idea. And the efficiency of heat pumps is bound by T_+ / (T_+ - T_-). For typical T_+ = 300 K, T_- = 270 K, the maximum efficiency would be 1000%. Brilliant!
Thermodynamics is all about equilibrium. When change happens, as in an engine, the change must be quasi-static (infinitely slow); reversibility means the tiniest nudge would cause the process to proceed in the other direction. Real life isn't quasi-static or reversible. Non-equilibrium thermodynamics is a fascinating field, whose ideas gave rise to diffusion models in machine learning in 2015 under Jascha Sohl-Dickstein and Surya Ganguli in joint work done while Jascha was a resident scholar at Khan Academy.

"Today's an exciting day, guys. We'll be using a lot of chalk!" - Professor Richard Fletcher

"The derivation just takes a couple boards." - Professor Richard Fletcher

21A.550 DV Lab: Documenting Science through Video and New Media

Ask "when" questions in an interview. These will elicit stories, which are the raw material of an emotional journey. Another tip is to let one crew member handle all the tech while the other converses with the participant to make them feel comfortable. Start filming before everything is set up so there is no scary moment when you say, "Now we're rolling." Errol Morris's technique to wait uncomfortably long after a response can often elicit further, deeper reflection from the subject. Balance with kindness. Interviewees should answer in full thoughts, introduce themselves at the beginning by spelling their name, and often be asked to repeat lines to get better takes. Remember to sign release forms.
Audio more important than visuals. Use a lavalier (lav) mic when possible, or else a shotgun mic (sticks on top of camera) or a boom mic (held on a rod just above the frame). The mic connects through a wire to a mixer box that screws under the camera, whence it connects via a short curly wire to the camera. It's best to record between -6 and -18 dB, because the goal when recording is never to clip; we can adjust the volume in post. For the camera, set shutter speed to twice the framerate (e.g., 48 for 24). Set exposure in the ballpark of 400 for inside, 2000 for outside. Adjusting f-stop between 1 and 5 or so reduces light intake by reducing the aperture radius (inverse to f-stop). For a moving object, without loss of generality moving right, frame it in the left third of the screen to let the movement breathe.
Adobe Premiere is solid editing software. Key concepts include keyframes (continuous transitions of any feature, e.g., volume, opacity, speed), sequences (to divide up work), traversal of the footage (j scoots left, k pauses, l scoots right, where multiple taps cause fast-forwarding), preview clips (i to begin a selection, o to end a selection, comma or period to place it into the timeline), folder structure (capture, footage, audio, etc.), and autosave. Adobe Podcast lets you edit speech like a Google doc, plus enhances speech with AI. Udio creates AI music.
Dziga Vertov's dream was to create a language of movement to write video, like how musical notation writes sound. The closest humanity has gotten so far is vector quantized latent space of diffusion models, on whose tokens we can train an autoregressive transformer.
When filming a documentary, surround yourself with luminous advisors. Wyatt Roy, Chris Boebel, and others are the thinking partners who will enrich the journey and dispense timely wisdom about story arc, characters, and technical questions.

6.S953 Embodied Intelligence

The literature in AI and robotics has lots and lots of ideas from the 1990's, 2000's, and 2010's before the field converged on to the deep learning paradigm. It is useful to understand where a field comes from, especially because meditating on old ideas can often help to break the present methodological dogma. For example, an old method like RRT-Connect may find application by operating in the rich representation space of a frontier model. In robotics, Fast, Cheap, and Out of Control (Brooks and Flynn, 1989) proposes that we should explore other planets like Mars not with one big specialized robot, but with hundreds of lightweight insect robots.
Curiosity means exploring states for their own sake, not strictly for reward. The question is what constitutes interesting or helpful states to explore, since exploring everything takes too long (Lehman and Stanley, 2011). Planning as Heuristic Search (Bonet and Geffner, 2000) framed the problem as finding useful heuristics to guide graph search algorithms, with no curiosity. One approach to curiosity is IW(k) (Lipovetzky and Geffner, 2012), which iteratively explores in a hierarchy of more fundamental to less fundamental new information, called the width. Another approach is Upper Confidence Bounds applied to Trees (UCT; Kocsis and Szepesvári, 2006), which encourage exploration by adding to a state's historical sample reward a curiosity term like sqrt(ln(T)/N), for T the total timesteps and N the number of times the agent has selected the particular action so far. Therefore, all neglected actions get selected infinitely often, causing no regret in the limit. Monte Carlo Tree Search (MCTS) with random rollouts is another technique, with landmarks achievements in self-play as in AlphaZero (Silver et al., 2017).
Reinforcement learning is tricky to get right, because noisy experiments require many random seeds to prove effects (Deep RL that Matters; Henderson et al., 2017). Often, newfangled algorithms like RL^2 (Duan et al., 2016) do not stand up to scrutiny, with a meta-analysis finding that vanilla approaches like PPO performing equally well or better in replication studies (Assessing Generalization in Deep Reinforcement Learning; Packer et al. 2018). Using 5 random seeds doesn't cut it!
Mutual information is a useful metric to maximize as a deep learning objective, e.g., in Contrastive Predictive Coding (van der Oord, 2019). For multi-goal RL, the right domain to optimize might be value functions that are quasimetric in latent space (Wang et al., 2023), which allow breaking symmetry while preserving other properties of distance.
One theme in deep learning is that rich representations are gold, and if there is anything you don't know you can learn it. For example, in CICERO (FAIR, 2022), the agent behavior clones human play so that, while planning, it can trade off guesses at human-like play with search over perfectly rational play. Whatever you don't know, learn it.

"Horizon-aware RL^2,

Some questions still up in the air,

But thank you to William and Phil

For a semester that was a thrill." - Country jingle, composed with GPT-4o, sung by Udio, thanking Professor Isola and TA William Shen after my final project presentation.

6.3100 Introduction to Controls

PID control stands for proportional, integral, derivative. We want to make a measured time series variable y_n approach y_desired. The difference is err_n. The amount of force we apply is then K_p err_n + K_d err_deriv_n + K_i err_cum_n, where K_p, K_d, and K_i are numbers called gains, err_deriv_n is the finite difference derivative of error, and err_cum_n is the cumulative sum of error over time. Measurements are noisy. The hope is to directly combat error (P), gradually fight steady-state error (I), and instantly correct jerks away from the desired position (D).
State space control is more elegant. We track a vector of states x. We evolve by \dot{x} = A x + B u, y = C x, u = K_r r - K x, where u is the vector we control. The rows of x can be quantities that are derivatives of each other, such as position, velocity, and acceleration. The brilliance is that high-order differential equations can be split into a system of many first-order differential equations. And track does not mean measure. The second brilliant idea is to track more quantities than we can measure, allowing us to update based on known relationships between known quantities and our guesses of certain quantities. State space tracking is smoother and more accurate than PID, often preferable.
The linear-quadratic regulator (LQR) method is helpful to determine gains for state space control. LQR minimizes least squared error of tracking each quantity, weighted as we desire.
Transfer functions are polynomial equations in s that summarize differential equations by applying Laplace-Fourier style transformations, namely d/dx --> s, \int --> s^{-1}. We can time-evolve systems given only a transfer function.
Magnetic levitation can be achieved with an optical sensor to measure time series distance data along with three coiled electromagnets arranged at right angles degrees to control the wiggling. The magnetic levitation control can send the magnet along arbitrary curves, including up and down along a sine wave.

"Brrrrrr, BRRRRRR, Brrrrrr, BRRRRRR" - Chopsticks with wings, in lab, whirring to the correct angle of the square wave control.

18.650 Fundamentals of Statistics

Linear regression from Y = X \beta + \eps, with \eps ~ N(0, \sigma^2), produces a least squares parameter estimate \hat{\beta} ~ N(\beta, \sigma^2 (X^T X)^{-1}). The variances allows us to test whether a coefficient for a feature is far enough from zero to say that the feature matters in the prediction.
The delta method lets us find the variance of continuous functions of sample means. If \sqrt(n)(\overline{X_n} - \mu) \to N(0, \sigma^2), and if g : \R \to \R is continuous with g'(\mu) \neq 0, then \sqrt(n)(g(\overline{X_n}) - g(\mu)) \to N(0, \sigma^2 g'(\mu)^2). For random vectors, the form is N(0, \nabla g(\mu)^T \Sigma \nabla g(\mu)) for X_n with variance \Sigma.
The permutation test lets us distinguish control from experimental directly. We gather N samples from each, then list them all up together. We compute the difference in the sample means of each half. Then, about 1000 times, we permute the 2N data points and again compute the difference in the sample means of each half. By the end, we have a distribution of differences of sample means of the two halves. The last step is to look where the original difference of sample means lies in the distribution. The percentile is the p-value. If the original splitting of the data is such an extreme variety out of the permutations, it is strong evidence that the control and experiment were different. If the original splitting of the data is typical of the permutations, then we conclude the experiment had little effect. The permutation test was used to detect the Higgs boson!
Usually, you put all the data into creating a sample mean and have no data left to ask about the variance of that estimator. Bootstrapping lets us find the variance. If we have N data points, we sample 1000 bootstrap sample means by picking N data points from our pool uniformly at random with replacement. Variance of the bootstrap equals variance of the random variable. For nuance, after we get the bootstrap samples from data, then the variance downgrades to sample variance of the data. But before we even gather data, the expected variance of the bootstrap is correct.
Chi-squared goodness of fit test lets us determine whether our data was drawn from a certain categorical distribution. By comparing p_xy with p_x p_y, chi-squared can also test for independence over finite alphabets. The Kolmogorov-Smirnov test can distinguish whether samples originated from a certain continuous CDF. The Kolmogorov-Lilliefors test can distinguish whether samples originated from a Gaussian with the sample mean and sample variance observed.