All the best reanalyses include uncertainty estimates, and I've been using fog to mark times and regions where the analyses are too uncertain to provide much value. This works well with 20CR - it's easy to make a compelling plot showing a relatively low spread in regions with lots of observations, and a much wider spread elsewhere.
But there are many different possible metrics for uncertainty: I originally chose the ratio of ensemble standard deviation (esd) to climatological standard deviation (csd), and this mostly works, but it undervalues the reanalysis in places where there is a lot of weather. If a big storm is coming but the magnitude and location are uncertain, esd can be very large, but the reanalysis is still informative. So I should include signal magnitude as well as spread.
The obvious approach is some combination of csd/esd and (em-cm)/csd (em=ensemble mean). Gil pointed me at the concept of Relative Entropy - more impressively known as Kullback-Leibler divergence. The K-L divergence between the reanalysis ensemble and the climatological distribution (Dkl(R||C)) is the information lost when using the climatology as an estimator of the reanalysis. In general it's hard to calculate, but under a reasonable set of simplifying assumptions:
Dkl(R||C) = 1/2*(ln(csd**2/esd**2) + (esd**2/csd**2)-1 + (em-cm)**2/csd**2)
(That's for a single variable - I'm only using MSLP here. It does generalise to multiple variables (could add temperature, precip etc.), but results are dominated by the variable where 20CR has most skill, so just using MSLP is reasonable).
So where the analysis is unconstrained by observations, esd=csd and em=cm and Dkl(R||C)=0 (might as well use the climatology). As we add observational constraints either esd will shrink or em will become different from cm or both, and Dkl(R||C) becomes positive.
This would be fine if the reanalysis were based on a perfect model, but in reality, in the absence of observational constraints esd!=csd and em!=cm and Dkl(R||C) is large. I wangle round this by replacing climatology with a, somewhat arbitrary, Uninformed distribution (U) and choose usd=max(esd,csd) and um=30-year running mean from reanalysis. Dkl(R||U) is then 0 where the reanalysis is unconstrained and increases with observational constraints.
The results look like this (plots show prmsl and 10m wind actuals, and air.2m anomalies - yellow dots mark observations):
and for 1918:
I do still have to choose an arbitrary threshold for fog (here Dkl(R||U) <=1) but it doesn't make an enormous difference as Dkl(R||U) goes from zero to big quite abruptly. I'd like to make such a video for the entire 140-year span of 20CR, to show the fog dissipating as the observations coverage increased, but it would take forever to render and hours even to watch. But the method is general - it should work at any timescale - so I've made the long video using monthly data:
Monthly weather is a bit random and discontinuous, but it does show the improvement in 20CR as the observations coverage increases. I like the variable but persistent effect of the Port Stanley observation in the Falklands.
It still feels a bit wrong to me: I was originally working just with the ensemble spread, and Dkl is exponentially more sensitive to the mean anomaly than to the spread - so the weather matters much more than the analysis precision. But the literature seems clear that it's the right metric.