Here's a general case that the 1-step learning will failed to achieve the optimal parameters, inspired by [1].
Consider the exponential family model
where (we can also consider it as a multi-expert system). Assume the variable takes discrete states and a transition operator s.t. for any , there exists an integer such that . Then we can find some data set satisfying
With these assumptions we construct a markov chain with T as the transition in each step, using the Metropolis method, then in the training process , which means that we cannot achieve the optimum as the maximum likelihood algorithm does.
The statement indicates that the performance of the one-step learning depends on the model distribution and the transition in each step of the Markov chain. However, though the failure above will be addressed if provided enough training data points, in some cases the parameters will never converge or converge to the other points rather than the ML result. See the cases of noisy swirl operator and noisy star trek operator in [1] for details. But fortunately, training energy models with sampling from factored likelihood and posterior in each step yields good results with CD-k, which will be explained below.
Let's consider the justification of the CD-k algorithm Rich and I discussed in our meeting, to answer the questions I raised in the last research note. We learn the energy models by the maximum likelihood method, i.e. maximize
where is our constructed (optimal) model and is the data distribution. It's easy to see
and the right hand side term is the KL-divergence . For the energy model with coefficients we have
where is differentiable w.r.t. and is some normalizer. Then by doing some math we get the derivative of the expectation of the log-likelihood w.r.t. as follows:
In the RBM case, we have the model joint distribution
and similarly we get the derivative
However, the expectation is intractable, and in practice we sample the training set from . In this case we approximate the first term of the derivative by , where equals to, or is sampled from, the conditional distribution . For better generalization, we don't want the ML algorithm to minimize the KL-divergence exactly to 0, as P will be the average of Dirac functions
which makes the probability be zero at the points other than those in the training set. That's why the approximation of the ML optimum can help to smooth the model distribution.
Also notice that the derivative of the log-likelihood of the RBM contains the expectation of the energy function over both and with the joint distribution , which is also intractable. In general we use the samples from some MCMC method, e.g. Gibbs sampling, to approximate that derivative. Without loss of generality we assume have discrete states and the Markov chain is aperiodic, then consider the sampling conditional probability and , the Gibbs chain will converge to the equilibrium distribution and . To specify the statement, consider
is the joint distribution we get after 1 full-step sampling from the training data and , then we have So that justify the truncating of the Gibbs train after k full-step sampling, and Hinton asserted that experiments showed that k = 1 yielded good training results.
References
[1] David J. C. MacKay, Failures of the One-Step Learning Algorithm, 2001.