Here's a general case that the 1-step learning will failed to achieve the optimal parameters, inspired by [1].
Consider the exponential family model




![k \in [1, N]](http://www.yingzhenli.net/home/blog/wp-content/plugins/latex/cache/tex_cd7f1d1b3c099f29059a3754da09ad1c.gif)



The statement indicates that the performance of the one-step learning depends on the model distribution and the transition in each step of the Markov chain. However, though the failure above will be addressed if provided enough training data points, in some cases the parameters will never converge or converge to the other points rather than the ML result. See the cases of noisy swirl operator and noisy star trek operator in [1] for details. But fortunately, training energy models with sampling from factored likelihood and posterior in each step yields good results with CD-k, which will be explained below.
Let's consider the justification of the CD-k algorithm Rich and I discussed in our meeting, to answer the questions I raised in the last research note. We learn the energy models by the maximum likelihood method, i.e. maximize








In the RBM case, we have the model joint distribution
However, the expectation is intractable, and in practice we sample the training set from
. In this case we approximate the first term of the derivative by
, where
equals to, or is sampled from, the conditional distribution
. For better generalization, we don't want the ML algorithm to minimize the KL-divergence exactly to 0, as P will be the average of Dirac functions
which makes the probability be zero at the points other than those in the training set. That's why the approximation of the ML optimum can help to smooth the model distribution.
Also notice that the derivative of the log-likelihood of the RBM contains the expectation of the energy function over both and
with the joint distribution
, which is also intractable. In general we use the samples from some MCMC method, e.g. Gibbs sampling, to approximate that derivative. Without loss of generality we assume
have discrete states and the Markov chain is aperiodic, then consider the sampling conditional probability
and
, the Gibbs chain will converge to the equilibrium distribution
and
. To specify the statement, consider



References
[1] David J. C. MacKay, Failures of the One-Step Learning Algorithm, 2001.