# The Maximum Entropy Formalism

Posted by on Jan 1, 2013 in Science | 0 comments

The maximum entropy formalism is the work of E.T. Jaynes. It has seen applications in everything from spectral analysis to black hole entropy.

According to Jaynes, (in Where do we stand on maximum entropy?)

In principle, every different kind of testable information will generate a different kind of mathematical problem. But there is one important class of problems for which the general solution was given once and for all, by Gibbs.

This mathematical form consists of the identification of a discrete probability distribution, $p_i$, for $n$ discrete variables and for which the distribution is constrained by the specification of the mean values of certain functions ${f_1(x), f_2(x), ..., f_m(x)}$:

$\sum _{i=1}^n p_i f_k\left(x_i\right)=F_k, 1\leq k\leq m$

where the sequence of mean values, ${F_k}$, are given in the problem statement. Assuming that $m < n$, the problem is under-determined; that is, there are two few equations to define all of the unknown variables. [This is even more true when we take the problem over into the identification of a continuous probability distribution over a (possibly infinite) range of real numbers.]

The solution for the $p_i$ is found by maximizing the entropy of the distribution:

$S = -\sum _{i=1}^n p_i ln(p_i)$

subject to the $m$ constraints on the mean values listed above together with the constraint that

$\sum _{i=1}^n p_i = 1$

The method of solution introduces $m+1$ Lagrange multipliers. The constraint on the total probability being $1$ leads to the definition of the “partition function”, $Z$ as

$Z\left(\lambda _1, \lambda _2, \text{...}, \lambda _m\right) \equiv \sum _{i=1}^n \exp \left[-\lambda _1f_1\left(x_i\right)-\lambda _2f_2\left(x_i\right)- \text{...} - -\lambda _mf_m\left(x_i\right)\right]$

where the ${\lambda_k}$ are the sequence of $m$ Lagrange multipliers corresponding to the $m$ constraints.

The problem has the formal solution that the $p_i$ are

$p _i=\frac{\exp \left[-\lambda _1f_1\left(x_i\right)-\lambda _2f_2\left(x_i\right)- \text{...} - -\lambda _mf_m\left(x_i\right)\right]}{Z\left(\lambda _1, \lambda _2, \text{...}, \lambda _m\right)}$
The Lagrange multipliers themselves are given by the $m$ equations

$F_k = -\frac{\partial}{\partial \lambda_k} log Z , 1\leq k\leq m$

The resulting entropy maximum depends only on the given data and is

$S(F_1, F_2, ..., F_k) = log Z + \sum _k \lambda_k F_k$

If this function can be explicitly found, then the $\lambda_k$ are just

$\lambda_k = \frac{\partial S}{\partial F_k} , 1\leq k\leq m$

Given the set of probabilities defined by this maximum entropy solution, then the best prediction that we can make of any other quantity, $q(x)$, related to the problem statement is just its expectation

$\langle q(x)\rangle =\sum _{i=1}^n p_iq\left(x_i\right)$

Likewise, many statements about the variances, covariances, and reciprocities between such additional quantities can be taken from the identity

$\langle q f_k \rangle - \langle q \rangle \langle f_k \rangle = -\frac{\partial \langle q \rangle}{\partial \lambda_k}$

There is a useful special cases in which $q(x) = f_j(x)$ and $j=k$ as follows:

$var(f_k) = \langle f_k^2 \rangle - \langle f_k \rangle^2 = \frac{\partial^2}{\partial^2 \lambda_k} log Z$

and likewise

$covar(f_k f_j) = \langle f_k f_j \rangle - \langle f_k \rangle \langle f_j \rangle = \frac{\partial^2}{\partial \lambda_k \lambda_j} log Z$

The various functions, $f_k(x)$ may depend upon certain parameters $\left\{\alpha _r\right\}$, thus
$f_k = f_k(x: \alpha_1, \alpha_2, ..., \alpha_r)$
In any given problem, these parameters may have a given physical meaning such as volume, electric field intensity, dipole moment, or whatever. In this instance, we can consider variations in the maximum entropy problem such that
$\delta S = \sum _k \lambda _k \delta Q_k$
where
$\delta Q_k = \delta \langle f_k \rangle - \langle \delta f_k \rangle .$

The notation here has been chosen for a specific historical reason. In thermodynamics, $Q$ is heat when the constraint on $F_1$ is the mean energy, and the heat is considered not to be a function of state so that $\delta Q$ is not considered to be an exact differential of the state of a thermodynamic system, unlike, for example, the energy or entropy. Nonetheless, the temperature is considered to be an integrating factor for the heat so that $\delta Q/T$ is an exact differential of a state function; viz., the entropy, $S(F_1, F_2, ..., F_m; \alpha_1, \alpha_2, ..., \alpha_r)$.

In classical thermodynamics, the commonly used label for the first Lagrange multiplier is $\beta = 1/(kT)$ where $k$ is Boltzmann’s constant and in our current notation $\lambda_1 \equiv \beta$. In the more general application of the formalism, we see an echo of the distinction between exact differentials of state functions and the use of the Lagrange multipliers as integrating factors for more general functions that arise in the problem. In Gibb’s original exposition of the method, the additional constraints were the particle numbers of the various components of a heterogenous equilibrium mixture and the other Lagrange multipliers, generally written as $\mu_k$, were the chemical potentials of these various species.

## Fluctuations and Variances

In the context of a consideration of fluctuations in the functions given in the statement of the problem, it is worth pointing out a distinction between the degree of accuracy of an estimate of a mean value and the physical fluctuations in the function. These two are often conflated inappropriately.

Consider any function $f(t)$ of a physical system. If we have a probability distribution for $f(t)$, then the best prediction that we can make for $f(t)$ in the sense of the minimum expected square error is
$\langle f(t) \rangle = \langle f \rangle$

The reliability of this estimator is related to the variance in this mean value; viz.,
$[\Delta f(t)]^2 = \left\langle f^2\right\rangle - \langle f\rangle ^2$
Assuming that the system under consideration is at equilibrium, then the mean and variance are stable estimators independent of the time, $t$. The estimator $\langle f \rangle$ is reliable only to the extent that $\Delta f / \langle f \rangle \ll 1$.

To be clear, $\Delta f$ is a measure of uncertainty in the value of $\langle f \rangle$ as a predictor. Often, this number is also taken to be an estimator of the measurable root mean square fluctuations in $f$. Statistically speaking, there is a distinction between the standard deviation of the expectation and the expectation of the standard deviation. Said another way, knowledge of the mean of $f$ to an accuracy of, say ±5% does not imply that $f$ fluctuates by ±5%.

Let us say that we can find a time average for $f(t)$ through actual measurement as

$\bar{f} \equiv \frac{1}{T}\int _0^T f(t)dt$

Assuming that there is independent knowledge of the probability distribution for $f(t)$ then the best estimator for $\bar f$ is

$\langle \bar f \rangle = \langle \frac{1}{T}\int _0^T f(t)dt \rangle = \frac{1}{T}\int _0^T \langle f \rangle dt$

So, to the extent that $\langle f \rangle$ is independent of the time, then
$\langle \bar f \rangle = \langle f \rangle$
which is to say that while any specific time average of a physical variable may not equal the true mean, the expectation of the time average and the true mean are equal.
But even this equality does not tell whether any given value of $\bar f$ is a reliable or accurate one. For this, again, we compute the variance:

$\left(\Delta \bar{f}\right)^2\equiv \left\langle \left(\bar{f}-\left\langle \bar{f}\right\rangle \right)^2\right\rangle = \frac{1}{T^2}\int _0^T\int _0^T[\langle f(\tau )f(\upsilon )\rangle -\langle f(\tau )\rangle \langle f(\upsilon )\rangle ]d\tau d\upsilon$

And again, the value of $\bar f$ is a good estimator if and only if $[\Delta \bar f / \langle \bar f \rangle ] \ll 1$.

In equilibrium systems, we can introduce the covariance function of $f(t)$ which is a function only of the difference in time values; viz.,
$\phi(\tau) \equiv \langle f(t)f(t+\tau)\rangle - \langle f(t)\rangle \langle f(t+\tau)\rangle = \langle f(0)f(\tau)\rangle - \langle f\rangle^2$

So the variance in the time average can be written in terms of the covariance function as

$\left(\Delta \bar{f}\right)^2 = \frac{2}{T^2}\int _0^T(T-\tau )\phi (\tau )d\tau$

Let us now define the correlation time $\tau_c$ as

$\tau _c\equiv \frac{\int _0^{\infty }\tau \phi (\tau )d\tau }{\int _0^{\infty }\phi (\tau )d\tau }$

If the integrals in the definition converge and $\tau_c$ is finite, then in the limit we have that

$\left(\Delta \bar{f}\right)^2 \sim \frac{2}{T}\int _0^{\infty }\phi (\tau )d\tau$

which implies that $\Delta \bar f$ will tend to 0 in proportion to $1/\sqrt{T}$. This may be seen as a feature of time-sampling over independent time windows of duration about $\tau_c$. On the other hand, if $\tau_c$ is not finite and correlations persist beyond any practical measurement time, then there can be no accurate value for $\bar f$ since the variance will tend to infinity for large sample times.

Now, this entire argument can be repeated for measurable fluctuations in $f(t)$; viz., $\delta f(t)$,

$( \delta f)^2 \equiv \frac{1}{T} \int_0^T \left[f(t)-\bar{f}\right]^2 dt$

Once more, we turn to the expectation of the fluctuation as

$\langle (\delta f)^2 \rangle = (\Delta f)^2 + (\Delta \bar f)^2$

From this it is apparent that the expected measurable fluctuation $\delta f$ is not the same as $\Delta f$ unless $\Delta \bar f$ is negligible compared to $\Delta f$, which depends on certain characteristics of the system and a long averaging time.

To see whether or not the estimator $\langle (\delta f)^2 \rangle$ is reliable, we have to go one more step up the hierarchy and consider its variance:

$V = \langle (\delta f)^4 \rangle - \langle (\delta f)^2 \rangle^2$

This variance can be expressed as

$V = \frac{1}{T^4}\int _0^T\int _0^T\int _0^T\int _0^T\psi \left(t_1,t_2,t_3,t_4\right)dt_1dt_2dt_3dt_4$

where

$\psi \left(t_1,t_2,t_3,t_4\right) = \langle f(t_1)f(t_2)f(t_3)f(t_4)\rangle -2\langle f(t_1)f(t_2)f^2(t_3)\rangle + \langle f^2(t_1)f^2(t_2)\rangle - [(\Delta f)^2 + (\Delta \bar f)^2]^2$

which employs symmetry in the domain of integration over $T$.

Now, going back to where we started; viz., the assumed equivalence between $\Delta f$, the estimator of the standard deviation of the mean of f, and $\delta f$ the measurable fluctuations of f, we find that this depends in fact on some fairly significant assumptions about the nature of the system under consideration.

There are many situations in which we find that $\Delta f$ fails to converge. This does not imply that the measurable fluctuations, $\delta f$, are infinite. Rather, this simply implies that the information available in the problem fails to make any reliable prediction of the mean, $\langle f \rangle$.

## Summary

In this post, I have simply set forth the basic elements of the formalism as given by Jaynes. None of this is original work in any way. I am only setting this down in one place for reference. Since the Wikipedia article on the maximum entropy method does not contain any reference to the variance, covariance, and fluctuation aspects of the formalism, I thought it was worth including these explicitly.

The formalism can be extended to the continuous case. I’ll consider that in another post.