Data Rescue

Created by Cathy.Smith@noaa.gov on - Updated on 08/09/2016 11:36

This area is devoted to issues affecting those who are rescuing historical weather data for submission to the reanalyses data repositories.  Finding, imaging (maybe) and digitising the data may just be a starting point for you, especially if you are working with data you are not intimately familiar with.  Re-casting this data in a format and at quality level you feel comfortable with raises a number of concerns, many of which will be common to the data rescue community. 

As questions are posted and answers are given, you should hopefully develop here the confidence to submit your datasets into these worldwide repositories.

 File containing UDUNIT compliant names.

Data Rescue Discussion and Planning Pages

Canadian station data rescue - concentrates on sources of station data over Canada from the 18th to 20th centuries. (login required)

 

Some international data rescue and integration efforts:

The International Surface Pressure Databank collects land station and marine surface and sea level pressure data.

The International Surface Temperature Initiative is developing a Global Land Surface Databank which will take all steps of data rescue from images of any meteorological variables to keyed-in temperature observations. There is a series of blogposts on the recently beta version released global land surface databank at http://surfacetemperatures.blogspot.com/ and a static summary at www.surfacetemperatures.org/databank/ . The latter includes a pdf describing how to submit the data of digitized observations.

ISTI has many activities as part of its data rescue task (described here).

The activities of the Atmospheric Circulation Reconstructions over the Earth working group on data rescue are described here.  ACRE coordinates with the World Meteorological Organization Data Rescue, IEDRO, and others as well as ISTI and ISPD and facilitates data rescue for these and other organizations.

The International Environmental Data Rescue Organization  (IEDRO) works to rescue and digitize historical environmental data at risk, creating a safer and better global society. To help, volunteer, donate, or shop .

oldWeather is recovering historical weather observations from the logbooks of the Royal Navy; the US Navy, Coastguard, and Coast Survey. It's a citizen science project, with a large community of volunteers reading and transcribing the logbooks. See the website and the blog.

The RECovery of Logbooks And International Marine data (RECLAIM, Woodruff et al. 2014)--a cooperative international project to locate and image historical ship logbooks and related marine data and metadata from archives across the globe, and to digitize the meteorological and oceanographic observations for merger into ICOADS.

Citizen science volunteers are also typing in Canadian data from archvial sources at the Canadian Volunteer Data Rescue project (ACRE-Canada) 

 

References:

    Woodruff, S., Freeman, E., Wilkinson, C., Allan, R., Anderson, H., Brohan, P., Compo, G., Claesson, S., Gloeden, W., Koek, F., Lubker, S., Marzin, C., Rosenhagen, G., Ross, T., Seiderman, M., Smith, S., Wheeler, D.  and   Worley, S., 2014:  Technical Report: ICOADS Marine Data Rescue: Status and Future Priorities,  38 pp. [http://icoads.noaa.gov/reclaim/pdf/marine-data-rescue.pdf].

Peter Siegmund (not verified)

Wed, 06/29/2016 - 02:16

To stimulate and coordinate data rescue activities, recently the International Data Rescue (I-DARE) Portal has been set up, http://www.idare-portal.org. The Portal provides a single point of entry for information on the status of past and present worldwide to be rescued data and data rescue projects, and on best methods and technologies involved in data rescue. The Portal is supervised by the WMO's Expert Team on Data Rescue, under the auspices of the Global Framework for Climate Services, and is operated by the Royal Netherlands Meteorological Institute (KNMI).

victoria.slonosky

Tue, 11/11/2014 - 16:43

/Users/Vicky/Desktop/Mtl_McGillObs_Baro_Jan1_1877.png
An image of some of the sub-daily McGill pressure readings. Work in progress to follow Christa and get an online data rescue & digitization project going.

Christa Pudmenzky (not verified)

Tue, 04/15/2014 - 01:36

Every year the Australian Broadcasting Corporation (ABC) runs an ABC Science's National Science Week Citizen Science Project. I have submitted a proposal to get the thousands of Clement Wragge’s log book images I have photographed digitised as part of the project. My proposal has been successful. The project runs for the months of August. Christa Pudmenzky International Centre for Applied Climate Sciences (ICACS) University of Southern Queensland Toowoomba, Australia

In digitising the lighthouse records, the thermometer is so constant that we have to conclude the readings were done indoors. Can we presume these readings are still valid for the ISPD? thanks, Mac

Mac, These readings are also still good. Correct for the temperature that is attached to the barometer. thanks! gil compo

We are digitising pressure readings from original lighthouse records. There are up to 14 sub daily readings. Is this level of detail useful for the reanalysis model or should we be aiming for somwthing less. The data entry person works rapidly so I suggest a useful answer would be either 1-3 readings only or 'all in".

Mac, Sorry for not answering this. Digitize "all in". You never know what will be useful. The reanalysis systems can make sense of the data. best wishes, gil compo

Mac, would you provide the link to how to submit images to the Global Land Surface Databank? I can't find it. thanks, gil

Peter Thorne (not verified)

Thu, 10/18/2012 - 14:19

Mac, for land meteorological data: 1. We would love to have the images if nobody else can curate them. Or even if they can ... 2. We would take any elements of digitized data but particularly temperature. It is great to have data with a full provenance trail back to the original observation. There is a series of blogposts on the recently beta version released global land surface databank at http://surfacetemperatures.blogspot.com/ and a static summary at www.surfacetemperatures.org/databank/ . The latter includes a pdf describing how to submit the data. We would love to see some early Australian data. Good luck Peter

Peter, are there guidelines for how to submit images? If so, would you provide a link in the text above? thanks,
gil

There is a link to how to submit at http://www.surfacetemperatures.org/databank/DataSubmission-Stage1-Guidance.pdf?attredirects=0

Just to put the strains on the good nature of the ispd folk, I'd like to propose a discussion on what we do with the images we create or discover during our data digitisation activities. Our project ensures we take high quality images of all our data and then store them in a retrievable format so they are available for future reference and as part of the audit trail. Importantly we are only digitising the surface pressure and some temperature data but often we pass by other structured and unstructured weather information that could have use in the future. In our case we have 50,000+ images which contain a wealth of background on Australian meteorology. I'm aware of very scattered repositories for these images, but nothing structural enough for us all to use or indeed to act as the final backstop for the Data Manifest I've suggested in the previous posting in this WIKI. We're not only creating a resource of data but we're also creating a resource of images - weather journals, ship's logs, records kept by early weather enthusiasts, government reports, explorers' journals, etc. Surely there is a curating job here that needs to be considered and acted on.

Any thoughts? Presumably the WMO should be taking an interest (!!).

On submitting documents to archives, a data manifest is usually included. For the purposes of the ispd, a manifest would be a super-metadata document that would describe each collection as it is submitted. This would give researchers the opportunity judge the quality of the data, be informed about it's provenance and notable issue associated with the digitising of that data. These issues may not have been important for the electronic data that originally populated the ispd, presumably it came from NMS datasets and the provenance was unassailable.

However, with the increasing shift to deep historical records, matters of interpretation are creeping in and issues of second and third hand data are appearing. For instance, our project group in Australia has been working from newsprint which raises issues of multiple handling, typos, rigours of data entry, etc. Though this might make some of you feel insecure, the ispd is becoming increasingly reliant on human systems as it moves away from the e-record.

Thus, I propose we establish a format for a dataset manifest that can referenced form any of its data items in the ispd. This would require another filed in the ispd, the manifest i.d.. It would also require the archiving and retrieval of manifests.

Having completed a career in Systems Analysis, I'm aware that you inevitably get one run at data uptake and you have to get it right, otherwise future generations will curse you. It's essential that you specify for future reference where the data came from, what its special properties are (eg. handwritten/typed, nms data or enthusiastic amateur, etc) and who/how it was digitised. These are the starting points of a manifest document. We have a standard set by the Australian National Archives that we've worked to in our imaging project, but it's too comprehensive for our needs.

Thus, I'd be interested in hearing from anyone with ideas on the subject. I'll then collate them for submission to Gil inclusion in the ispd submission guidelines.

In amongst the WMO recognised stations we are digitising, we have a clutch of smaller stations appearing erratically in our records. They don't appear in either of the ispd, WMO or local weather bureau station listings. However, we are able to identify geographically where the data was observed. Should we include these items with their best-guess lat/long and give them an ID code 9999... ?

Mac,
yes please include them with your best guess at lat/lon. You could set the station ID code to missing.
Some groups put the name of the station as the ID code when no other ID code existed.
The point of the ID code is to allow traceability back to the source provider of the data. If no code exists, whatever is going to help us relate information about this station back to you is what is most important.

thanks,
gil

Since our data is derived from secondary sources, sometimes with multiple handling involved, how can we flag that there is a degree of unreliability in it? I cannot find an explanation for the QC indicator referenced in this field, the "ISH data quality flag". Is this something we can use as a metadata quality flag?

Mac,

Just label the metadata quality flag as missing. Carrying flags that other sources have well-defined is one additional step to allow full traceability back to the source observation. In your case, it is best not to try to make an estimate without a full study.

We should modify the NCDC ASCII documentation to reference the old Integrated Surface Hourly (not Integrated Surface Database) documentation which defines the flags.

thanks for pointing this out,

gil

Most of the readings we are digitising come from secondary sources, newspapers, synoptic charts, weather journals. Should there be a code here to reflect this environment or just use code 000?

(and that should do us, hopefully!)

I've checked the UDUNIT text file and it's a complex choice for the correct reference to a reading done with a barometer reading units of mercury. Given that our observations were made by competent late 19th Century metaorologists, which of the codes do we use in field 19?

In an earlier post, it was indicated that these fields were adjusted to GMT. Since we're working with continent-wide historical data with no proof of how the observers determined their local time (this was before std time zones were set in Australia), we cannot truthfully convert to GMT. We note that there are no "original" time/date fields. Do we just make a best guess or do you have some way of making the conversion yourself?

A complication for us is that most observations are 9 am which, for eastern continental times, are most probably UTC plus 10. Thus, beginning of month observations have to be dated the last day of the previuos month which would require some fairly fancy software coding! thanks

Mac, If you know (or best guess) the observation time is LOCAL TIME, using longitude/15 as the utc offset is a good choice. When converting to GMT you may have to change the year and month too. I have the software to do that quickly. So if you want us to make the conversion, just fill in the whatever local time on document in the time filed, but leave the Field-10 BLANK; and also please make a note about this "local time used, need conversion" when you submit the data so we will not miss it. --Yin

Mac Benoy *Aus… (not verified)

Fri, 07/27/2012 - 00:45

Field 2, "Observation ID Type" shows the type of station ID in Field 1. We are using either the station ID in the ispd database or the one assigned on the Australian Bureau of Met website. What code(s) do we put in this field?

Mac Benoy *Aus… (not verified)

Thu, 07/26/2012 - 21:46

Fields 23, 24 & 25 (Latitude, Longitude and elevation) appear to be the same as fields 11, 12 & 13 "Observed lat/long/elevation"

Fields 23-25 are solely to provide the original data "as is". The fields 11-13 (latitude,longitude and elevation) are imposing range restrictions, units and an elevation definition. So they will be the same iif the original data has followed the convention of the latitudes being from -90.00 to 90.00, and the longitudes ranging from 000.00-359.99,
and elevation that was given in meters relative to mean sea level.

Mac Benoy *Aus… (not verified)

Thu, 07/26/2012 - 21:44

Is there a list of UD UNITS compliant unit names?

Chesley.McColl

Mon, 08/06/2012 - 15:15

In reply to by Mac Benoy *Aus… (not verified)

This is the current list of UDUNITS compliant unit names.

Mac Benoy *Aus… (not verified)

Thu, 07/26/2012 - 21:43

Field 10 requests "Time Code". We are working with data 19th C data that used local time, generally based on their longitute, so do we enter "001"?

Chesley.McColl

Mon, 08/06/2012 - 14:31

In reply to by Mac Benoy *Aus… (not verified)

From your previous question it sounds like you are using the longitude from the source to convert the times in Fields 4-8 to GMT, in this case you would need the "007"code, not
the "001"

Mac Benoy *Aus… (not verified)

Thu, 07/26/2012 - 21:37

Field 4, Pos: 19-22, of the submission guidelines is "Year (GMT) of the observation record". Is this the local year or is it corrected to GMT? eg. a 9am 1st Jan 1900 reading for a station that is GMT+10 would be corrected to GMT 1899?. Similar issues aries for month and day, hour.

Eric (not verified)

Thu, 01/12/2012 - 20:53

I am interested with the comparison methods: the reanalysis value for comparison with station was obtained from the weighted average of the reanalysis values of the four grid boxes whose centers lie closest to the station. The average of the four grid boxes is obtained from the inverse distance weighted average. how to derive the four grid boxes and average of the four grid boxes is obtained from the inverse distance weighted average? Do people here have some codes for Matlab? Thanks.

Gil Compo (not verified)

Thu, 11/10/2011 - 21:53

Mac, This a great question. The ISPD contains the information on how the reanalysis system used the data, including whether there were rejections. This so-called "feedback" information is intended to do just what you suggest. The idea is to preserve all of the necessary metadata from the source so that the data can be returned to the source for just the sort of improvement you are mentioning. That full feedback step has not happened, yet, but it is possible, and we will be following up with the collection providers to do just that. thanks, gil compo

We do several levels of data cleaning to ready our data for submission to the ispd. Knowing that the ispd data-uptake system includes an even more refined form of data cleaning, further data outliers must be discovered. Presumably the ispd system can clean most of these outliers, BUT there must still be the occasional mysteries leftover. Do you 'return' these mysteries to the owners for possible resolution or do your just ignore those particular data points?

Though I'm sure the data-uptake routines of the ispd look for outliers in data banks that are submitted, but we would like to do our own data cleaning first because we may be able to make corrections on the spot before it's discovered by the ispd routines. Thus, if comparing data day to day, what is an acceptable range of daily variance in one station's pressure readings. We are working in inches of mercury, so would a daily change of +/- 1 be significant? 1.5 or 1.8 or 2? Is there a meteorologist who has an opinion on this? With the range defined, we can then determine the outliers and get to work checking the validity of our data entry, or indeed, the original observation.

Where can I get a succinct and refined list of pressure data already digitised for my area of interest? It would be ideal if this list is specific to a geographical area and a time period. With this in hand, I know we won't be expending effort on readings already digitised. BTW, the graphical listing at http://www.esrl.noaa.gov/psd/data/ISPD/ doesn't seem to show station names, meaning you have to guess what the dots on the map are.

The dots on the maps correspond to the history list for that version. An uncompressed version of the station history file for ISPDv3 is at https://reanalyses.org/sites/default/files/groups/users/gilbert.p.compo… which can be imported into Excel following this format https://reanalyses.org/sites/default/files/groups/users/gilbert.p.compo….

Also, the status of many stations around the world as to whether they are known to be in hard copy, scanned images, or digitized can be accessed from the link VIEW_ISPD_STATIONS_EXCEL_FILE.html at http://badc.nerc.ac.uk/browse/badc/corral/images/metobs.

I am not a weather professional, but I can see the logic and value of pressure data because it can be used to construct isobaric maps. I can see and touch this product of our digitising work. Why should my team of volunteers spend time digitising temperatures (rainfall, cloud, wind direction, river levels, etc)?

Mac,

the facile answer is that there is substantial value to digitizing everything now. None of us owns a crystal ball. What may seem unimportant now or to you may be important to someone else or even to you in the future. There is also the question of effort and data security. It is less overall effort to digitize everything now than on an element by element basis over a period of time. Also, no media lasts forever and you may not have the opportunity to rescue these records 5 or 10 years down the line. In addition it is not just the data that is important - metadata - station history is also importnat for you and others to understand the data.

On the specifics of temperatures I chair the International Surface Temperature Initiative (www.surfacetemperatures.org) and the foundation of this effort is the raw data. All ways to analyze the data benefit from having more data - more series to intercompare or longer series - so there is substantial value to rescuing any temperature data you come across. We are starting to populate a global land surface databank with data provenance and version control (http://www.gosic.org/GLOBAL_SURFACE_DATABANK/GBD.html) which thus far has involved collating a number of sources and converting to a common format. The next step is to merge these individual holdings (http://editthis.info/intl_surface_temp_initiative/Main_Page). In the current phase the databank is only processing temperature data but the long-term aim is to process multi-elemental data (T,q,p,pptn etc). So we will accept at stage1 more than just temperature data.

Submission guidance is at http://www.surfacetemperatures.org/databank/DataSubmission-Stage1-Guida…

We would welcome submissions of whatever size from youurself or any other interested parties.

Peter

How useful is geographically contiguous data? Given that digitising data is a time-consuming event, how important to the reanalyses models is it that geograhically contiguous data is digitised? Are we wasting our time by doing a series of stations only 100/200/300/etc kms apart? Yes, you want EVERYTHING, but is there a priority issue here?

Mac,

100 - 200 km apart is actually ideal. The observations will cross-check each other through the reanalysis system. As a first priority, though, a spatially complete network across a region is most important. For the mid-latitudes, stations every 300 - 500 km away would be the highest priority.

Where in this website do we find the definitive source of information on the long/lat of each station? Do we need to apply this data if we have applied the persistent station identifier (station id) for reading?

In some cases we have multiple data sources for the same reading. Occasionally they disagree. Do we submit all multiple readings, or submit sibgle readings and choose best fit for those that have conflicting readings, or dump conflicting readings all together?

If you could submit the data by source, that would be best. The ISPD can take all of the sources and keep them separate. If digitizing all the sources of the "same" reading is too much, pick one source and identify. Different ISPD collection names can be used for the different sources.

Also, how much disagreement are you seeing?

Old pressure readings were recorded as inches of mercury. Do these have to be converted to pascals or is this done elctronically during the upload process?

You can supply the data in the inches of mercury unit. If you would like to supply the data in the ASCII exchange format, it should be stored in the inches of mercury unit in the "Original" section and converted to hPa in the data section (field 14 for sea level pressure and field 16 for surface pressure).

We have dozens of stations digitised but a couple of them are consistently incorrect (they just don't tally with nearby stations). Do we attempt a correction ourselves (eg. determine average error and adjust series accordingly) or does the ispd input process identify and correct/quarantine these readings?

Submit the data as is, or attempt a correction. If you attempt a correction, provide the original value as well. The ISPD station component can include your correction as metadata in the homogenization fields seen in http://reanalyses.org/sites/default/files/groups/ASCII_transfer_v1_0.pdf . Either way is fine.

The ISPD itself will not quarantine the data, but the data assimilation system will and will attempt to correct it.

How important is it that our data is normalised to sea level, taking into account temperature, height above sea level, height of instrument, etc.? Do the data input processes of the ispd determine in some fashion if these activities have been carried out?

For the purposes of data assimilation, reducing to sea level is not important at all. We need the time of observation, location including elevation of the instrument, latitude, longitude. The data need to be corrected for temperature and gravity. Most likely, for late 19th century data, this was already done.

The ISPD includes flags to indicate whether it is known if the source did these corrections and whether the ISPD did these corrections. In all cases, the original observations as provided by the source are maintained so that any corrections done by ISPD can be changed.

Add new comment

CAPTCHA
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.