Trace Estimation

2025-01-11


The other day on twitter, I got involved in a discussion about trace estimation, and decided to write down some thoughts on the matter.

Typically this is posed as follows. You are given a matrix AFN×NA \in F^{N\times N} such that we can only access it through matrix-vector multiplication xAxx \mapsto Ax. Estimate the trace of the matrix. trA=i=1NAi,i \text{tr} A = \sum_{i=1}^N A_{i,i} We may not be able to take this sum directly either due to size or representation of AA, this constraint of only accessing the matrix through it’s vector product is familiar in numerical linear algebra.

An easy false start, for someone mainly experienced with eigenvalue problems, is to remember that the trace of a matrix is exactly the sum of the eigenvalues. One would then ask, how can the eigenvalues be computed using only matrix-vector operations, and remember the Arnoldi iteration, an extension of the power method that constructs the Krylov subspace. Km=[x,Ax,A2x,,Amx] K_m = [x, A x, A^2 x, \dots, A^m x] After orthogonalizing the Krylov subspace with a Gram-Shmidt process into QmQ_m, we can create an upper Hessenburg matrix Hm=Qm*AQmH_m = Q^*_m A Q_m. We can then apply QR to this smaller matrix ( m<<nm << n) and use Rayleigh-Ritz to recover the associated eigenvalues of AA.

Does this help us estimate the trace? Not really! We most likely have recovered the eigenvalues of greatest magnitude (remember, power method), and to recover the rest of them, we either need to expand the Krylov subspace such that m=nm = n, which is totally intractable in this scenario, or we need to do restarted Arnoldi while shifting out converged eigenvalues, also likely intractable. The usefulness of Arnoldi and other iterative eigenvalue algorithms is largely due to not needing all of the eigenvalues of these matrices. This is not an entirely fruitless direction though, as the Rayleigh-Ritz procedure gives us a hint.

Let us recall the Rayleigh quotient, and how it can be written in terms of the eigenbasis (λi,vi)(\lambda_i, v_i). x*Axx*x=i=1Nλi(vi*x)2i=1N(vi*x)2 \frac{x^* A x}{x^* x} = \frac{\sum_{i=1}^N \lambda_i (v_i^* x)^2}{\sum_{i=1}^N (v_i^* x)^2} If we then choose xx such that E[(vi*x)2]=1E[(v_i^* x)^2] = 1 we get (henceforth ignoring the denominator which is basically one-ish). E[x*Ax]=λi=trA E[x^* A x] = \sum \lambda_i = \text{tr} A We then use this to construct an estimator, using linearity. tr̂A=1Mi=1Mxi*Axi \hat{\text{tr}} A = \frac{1}{M} \sum_{i=1}^M x_i^* A x_i

Let us take a detour and think how else we might come to this kind of conclusion. What if we use indicator vectors eie_i to recover the diagonal elements of AA? trA=i=1NeiTAei \text{tr} A = \sum_{i=1}^N e_i^T A e_i Given the constraints, it is unlikely doing NN matrix-vector multiplications is feasible, but we can choose a random sampling of the eie_i for an approximate answer.

In light of sampling from the diagonal directly, our Arnoldi approach can be viewed as (biased) sampling from the diagonal in the eigenbasis, and the Rayleigh quotient approach is like sampling from the diagonal in a random basis. All of the important analysis of this problem comes from the choice of distribution of xx.

In the literature, the standard method is due to Hutchinson, where he chooses xx with elements sampled from the Rademacher distribution, that is, each element is ±1\pm 1 with equal odds. The general requirement for the estimator to be unbiased is E[xx*]=IE[x x^*] = I, the expectation of the outer product being the identity. This is more general than our previous constraint.

Here is a simple implementation in Haskell using hmatrix.

-- | Generate a random Rademacher vector (+1/-1 with equal probability)
generateRademacherVector :: Int -> IO (Vector R)
generateRademacherVector n = do
    signs <- replicateM n randomIO
    return $ vector $ map (\b -> if b then 1 else -1) signs

-- | Estimate the trace using the Girard-Hutchinson estimator
estimateTrace :: Matrix R   -- ^ Matrix
              -> Int        -- ^ Matrix Dimension
              -> Int        -- ^ Number of samples
              -> IO R       -- ^ Estimated trace
estimateTrace a n numSamples = do
    -- Generate random vectors and compute estimates
    estimates <- replicateM numSamples $ do
        v <- generateRademacherVector n
        -- Rayleigh quotient: we can only access a through mat-vec
        return $ v <.> (a #> v)
    -- Average the estimates
    return $ sum estimates / fromIntegral numSamples

The actual performance of this algorithm, and the correct choice of distribution of xx is heavily dependent on the structure of the matrix. If the density of the trace is heavily concentrated, for example one value of 100 and the rest zeros, we would expect indicator vectors to perform very poorly! In general, through some contortion of the central limit theorem, we should expect convergence to be on the order of n\sqrt{n} where nn is the number of samples, which is not very fast, but good enough in practice for anything that might warrant such an approximation the begin with.