Entropy, cross-entropy, KL: cost of the wrong model
How costly is it to use the wrong model? KL(P||Q) is the extra bits you pay encoding samples from $P$ with a code optimal for $Q$.
Method · KL divergence
Intro
Entropy is your information budget for the True distribution. Cross-entropy is what you actually pay when you encode $P$-data with $Q$'s code. KL divergence is the difference β the “excess” cost of using the wrong distribution. KL is always non-negative and zero iff $P = Q$. ML training minimises cross-entropy for exactly this reason.
β Intro Β· expand
Try first (productive failure)
Before the worked example: spend 60 seconds taking your best shot at this.
A guess is fine β being briefly wrong about a problem makes the explanation
land harder when you read it. This appears once per tutorial; skip
if you already know the trick.
60s
β Try first Β· expand
Worked example
True distribution $P = (0.5,\ 0.5)$. Model $Q = (0.9,\ 0.1)$. Compute $D_{\mathrm{KL}}(P \Vert Q)$ in nats. (3 d.p.)
β Worked example Β· expand
Practice 1 of 3Type a fraction, decimal, or expression β mathjs parses it.
β Practice Β· expand
Reflection
Why is KL asymmetric, and when in practice would you compute $D(P \Vert Q)$ vs $D(Q \Vert P)$? What's the connection between minimising cross-entropy and maximum likelihood?