Several issues arise with quality controlling (QC) station data.
In HadISD version 1, the UK Met Office Hadley Centre is using the following procedures to subset and QC the data:
[RJHD] I've added a one sentence description of the QC tests used in HadISD which were mostly outlined in the talk - just to jog memories.[/RJHD]
Prior to QC:
Duplicate Stations: full station repetition (i.e. that the data in one station ID is repeated under another station ID somewhere. For example, if merging of stations for some reason didn't happen correctly.)
Duplicate Months: full month repetition within station
Frequent Value Check: Identify values on total dataset, flag on annual basis if still anomalous.
Gap Check: Flag months who's median is different to average, flag observations which are different to rest of population (both on calendar month basis)
Streak Check: streaks of the same value are flagged
Spike Check: spikes of up to three points are removed
Climatological Check: observations which are different to the climatology for each hour/month as calculated from the station are flagged.
Variance Check: months with differing within-station variance (high/low) to the rest of the record are flagged
Odd Clusters: short isolated clusters of data are flagged
Diurnal Cycle: periods with diurnal cycles apparently significantly offset to the rest of the series are flagged
Humidity Checks: excessively long periods of supersaturation and dew-point depression are removed.
Cloud Cover Logical Checks: logical consistency checks for Low, Mid, High and Total cloud amounts
Neighbour Check: up to 10 neighbours within 300km and 500m height are used to check if station values are reasonable. Can also remove flags from certain tests.
Clean-Up: flag months with excessive numbers of flags or very few remaining observations.
For these hourly and subdaily data, issues with these include:
- the 300 km radius may mingle stations that have different climates when performing a neighbor check.
- weigthing in favor of the nearest stations during the evaluations might be important
- surface pressure should be included as a processed variable to allow derivation of specific humidity from the dewpoint
- Coordinated and somewhat formalized feedback to NCDC should be considered
- e.g. when and how will NCDC use recommendations for station fixes uncovered by HadISD - users will eventually want to know this.
- When examining the dual valued curves, as shown in many of the QC examples, on should inspect to see that only one value is at each time interval, i.e. there are not duplicates at 0Z
- As much source data metadata, from NCDC ISD, should be carried forward in the netCDF metadata files, e.g.
- what ISDversion is the starting point for HADISD.
- station metadata (lat, lon, elev),
- software version,
- Possibly detail inventories for flagging, step by step in the QC process, could be useful for examining systematic errors. This could also be station by station and maintained online for user viewing. A condensed version of this should be published - as is planned.
- Test if Gaussian a good fit, and try other distributions when trying to characterise the width of the population distribution
- Use wind direction and wind speed together when doing QC
- If QC removes large sections of data after a clear break in the system, homogenisation checks won't find the now-removed break
- Provide metrics for flags and unflagging for final products.
- More than just 2 levels of flagging
- Use two missing data indicators, one for missing, one for flagged/removed