Trace Estimation
2025-01-11
The other day on twitter, I got involved in a discussion about trace estimation, and decided to write down some thoughts on the matter.
Typically this is posed as follows. You are given a matrix such that we can only access it through matrix-vector multiplication . Estimate the trace of the matrix. We may not be able to take this sum directly either due to size or representation of , this constraint of only accessing the matrix through it’s vector product is familiar in numerical linear algebra.
An easy false start, for someone mainly experienced with eigenvalue problems, is to remember that the trace of a matrix is exactly the sum of the eigenvalues. One would then ask, how can the eigenvalues be computed using only matrix-vector operations, and remember the Arnoldi iteration, an extension of the power method that constructs the Krylov subspace. After orthogonalizing the Krylov subspace with a Gram-Shmidt process into , we can create an upper Hessenburg matrix . We can then apply QR to this smaller matrix ( ) and use Rayleigh-Ritz to recover the associated eigenvalues of .
Does this help us estimate the trace? Not really! We most likely have recovered the eigenvalues of greatest magnitude (remember, power method), and to recover the rest of them, we either need to expand the Krylov subspace such that , which is totally intractable in this scenario, or we need to do restarted Arnoldi while shifting out converged eigenvalues, also likely intractable. The usefulness of Arnoldi and other iterative eigenvalue algorithms is largely due to not needing all of the eigenvalues of these matrices. This is not an entirely fruitless direction though, as the Rayleigh-Ritz procedure gives us a hint.
Let us recall the Rayleigh quotient, and how it can be written in terms of the eigenbasis . If we then choose such that we get (henceforth ignoring the denominator which is basically one-ish). We then use this to construct an estimator, using linearity.
Let us take a detour and think how else we might come to this kind of conclusion. What if we use indicator vectors to recover the diagonal elements of ? Given the constraints, it is unlikely doing matrix-vector multiplications is feasible, but we can choose a random sampling of the for an approximate answer.
In light of sampling from the diagonal directly, our Arnoldi approach can be viewed as (biased) sampling from the diagonal in the eigenbasis, and the Rayleigh quotient approach is like sampling from the diagonal in a random basis. All of the important analysis of this problem comes from the choice of distribution of .
In the literature, the standard method is due to Hutchinson, where he chooses with elements sampled from the Rademacher distribution, that is, each element is with equal odds. The general requirement for the estimator to be unbiased is , the expectation of the outer product being the identity. This is more general than our previous constraint.
Here is a simple implementation in Haskell using hmatrix.
-- | Generate a random Rademacher vector (+1/-1 with equal probability)
generateRademacherVector :: Int -> IO (Vector R)
= do
generateRademacherVector n <- replicateM n randomIO
signs return $ vector $ map (\b -> if b then 1 else -1) signs
-- | Estimate the trace using the Girard-Hutchinson estimator
estimateTrace :: Matrix R -- ^ Matrix
-> Int -- ^ Matrix Dimension
-> Int -- ^ Number of samples
-> IO R -- ^ Estimated trace
= do
estimateTrace a n numSamples -- Generate random vectors and compute estimates
<- replicateM numSamples $ do
estimates <- generateRademacherVector n
v -- Rayleigh quotient: we can only access a through mat-vec
return $ v <.> (a #> v)
-- Average the estimates
return $ sum estimates / fromIntegral numSamples
The actual performance of this algorithm, and the correct choice of distribution of is heavily dependent on the structure of the matrix. If the density of the trace is heavily concentrated, for example one value of 100 and the rest zeros, we would expect indicator vectors to perform very poorly! In general, through some contortion of the central limit theorem, we should expect convergence to be on the order of where is the number of samples, which is not very fast, but good enough in practice for anything that might warrant such an approximation the begin with.