Entropy, cross-entropy, KL: cost of the wrong model

How costly is it to use the wrong model? KL(P||Q) is the extra bits you pay encoding samples from $P$ with a code optimal for $Q$.

Method · KL divergence

Intro

Entropy is your information budget for the True distribution. Cross-entropy is what you actually pay when you encode $P$-data with $Q$'s code. KL divergence is the difference — the “excess” cost of using the wrong distribution. KL is always non-negative and zero iff $P = Q$. ML training minimises cross-entropy for exactly this reason.

Used in practice

Model comparison: KL divergence D(P||Q) measures how much information is lost when using model Q instead of the true distribution P — used in model selection (AIC ≈ 2·KL + constant) and in risk-neutral vs real-world measure comparison.
Relative entropy pricing: the risk-neutral measure Q is the measure closest to the physical measure P in KL-divergence subject to pricing the traded assets correctly — entropy minimization is the formal foundation of risk-neutral pricing.
Interview context: 'What is KL divergence and how does it relate to MLE?' — MLE minimizes KL(empirical||model). Minimizing KL divergence is equivalent to maximizing log-likelihood. Standard information-theory interview question.

✓ Intro · expand

Entropy, cross-entropy, KL: cost of the wrong model

Goals

New Goal

Weekly Missions