CMIP5 Best Practices for Data Publication
Guidelines for CMIP5 data publishers.
This document provides guidelines and recommended best practices for CMIP5 data publishers. Data publishers should be familiar with the following documents:
- CMIP5 Data Reference Syntax (DRS). The naming conventions in this document are guided by the DRS.
- CMIP5 Controlled Vocabulary
- CMIP5 Data Description and CMIP5 Experimental Design
Each CMIP5 dataset has a unique identifier generated at publication time. Dataset identifiers have the form:
where <product>, <institute>, etc. are defined by the ESG Data Reference Syntax. For example, the requested monthly data for the GFDL CM2.1 model, historical experiment, atmosphere realm, initial run would be:
- Datasets are defined at the ensemble level: all variables associated with the ensemble/realm are contained within the dataset. The identifier does not contain a variable name. This organization greatly improves publication performance relative to publication of DRS 'atomic' datasets.
- The permitted values for each DRS category are defined in the CMIP5 Controlled Vocabulary. It is critical that the ESG publication client configuration (esg.ini) be kept up to date with respect to the vocabulary, to ensure that CMIP5 datasets are searchable in a consistent fashion across the federation.
- The ESG publication client chooses a version number when the dataset is published, which may be overridden if necessary. When the dataset is published, the version number is appended to the dataset identifier to create a 'dataset version' identifier of the form <dataset_id>.vN, where N is the version number.
For CMIP5 publication, it is strongly recommended you use 'date style' versions of the form vYYYYMMNN. To generate this form at publication time, set in the [product:cmip5] section of the publisher configuration (see ESG Publisher Configuration): [product:cmip5] ... version_by_date = true
- The <institute> indicates where the data originated, not the group that published it.
- The <cmor_table> field was added for consistency with version 0.28 (14 September 2010) of the Data Reference Syntax, in which it is referred to as 'MIP table'. <cmor_table> is needed to distinguish between variables in the case where the CMOR table is 6hrLev or 6hrPlev.
Standard Variable Names
In order for variable names to be searchable at the data portal, the variables must have an associated standard name as defined by the CF convention. Data may be published without a standard name that appears in the CF standard name table, but in that case the data may not be searchable by variable name.
Note that the ESG publication client keeps a copy of the CF standard name table. When a new client distribution is released, the latest name table is included in the distribution. If the publisher is not updated for a period of time, the name table can fall out of date. To install the latest name table without upgrading the publisher:
- Copy the latest XML version of the CF standard name table to $HOME/.esgcet/cf-standard-name-table.xml
- In the [initialize] section of esg.ini, set
initial_standard_name_table = %(home)s/.esgcet/cf-standard-name-table.xml
- Run esginitialize to load the new table into the database:
% esginitialize -c
The default location of the name table is:
Requested vs Extended Datasets
The CMIP5 Data Description defines a set of requested CMIP5 datasets. It is possible to publish data to CMIP5 which is not within the bounds of the requested data. For example, an experiment may be extended beyond the requested time range. In this case, the dataset should be published in two parts:
1. The requested portion should be published with product = 'output1'
2. The remaining portion should be published with product = 'output2'
This convention facilitates replicating datasets at multiple data nodes. It is expected that replicating sites will only replicate the requested portion of datasets. Note that the esgpublish utility will automatically assign the value of product.
Each ESG gateway has a unique string identifier. The currently defined gateway identifiers are:
The gateway identifier is used to indicate the 'master gateway' or origin of the dataset when a replica dataset is published.
When a dataset is replicated to a different data node, it should be published with the same dataset identifier as the original dataset. If a dataset is published as a replica, for example with the --replica option of the esgpublish client, it is identified internally as a replica, and its originating gateway is retained.
Local Directory Structures
It is recommended that data nodes organize the local CMIP5 data directory structures consistent with the DRS directory layout guidelines. BADC has made the isenes.drslib package available to simplify this task.
It is also recommended that the number of root data directories be kept to a minimum. This will serve to:
- Simplify THREDDS and GridFTP installation. In the case of GridFTP it will be necessary to create a mount point for each root data directory.
- Simplify installation of VM-based data nodes.
The CMIP5 Controlled Vocabulary describes how the above recommendations can be configured in the ESG publication client. Two files are of particular importance:
- Configuration file (esg.ini): The [project:cmip5] section contains CMIP5-specific configuration options. If the data node installation script is used, the configuration file will be initialized with this project section. Otherwise, the configuration parameters are contained in the publisher software in:
- Models table (esgcet_models_table.txt): For CMIP5 the table should contain the CMIP5 model entries as listed in the CMIP5 Controlled Vocabulary. They are entered into the data node database with:
% esginitialize -c
The location of the models table is identified by the linitial_models_table option in the [initialize] section of esg.ini:
[initialize] ... initial_models_table = %(home)s/.esgcet/esgcet_models_table.txt
Refer to the ESG publisher configuration documentation for details.
ESG Gateway Dataset Hierarchy
Each ESG gateway stores a hierarchical representation of its datasets. Each dataset is published into a parent dataset, aka a Collection. Each collection has a dataset identifier, and is associated with one or more groups which have read and/or write permission for the datasets in the collection. The details of collection identifiers are left to the respective gateway administrators.
On the PCMDI gateway, the CMIP5 collections have identifiers of the form 'pcmdi.<publishing_institution>' where <publishing_institution> is the institution which owns the publishing data node associated with the PCMDI gateway. For example, if institution FOO has a data node which will publish data to the PCMDI gateway, the parent collection identifier for that institutions datasets will be 'pcmdi.FOO'. In addition there will be a group name FOO, which will be the only group having write permission to the pcmdi.FOO collection.