Skip to content

Preparing predictor data

miturbide edited this page Jun 20, 2018 · 21 revisions

Introduction

This document illustrates the preparation of different predictor configurations in perfect-prog experiments.

Broadly speaking, there are two main configurations:

  1. Using atmospheric fields "as they are" for a given spatial domain
  2. Using principal components obtained from these fields. This can be either a principal component calculated upon a particular variable, and/or a combined PC considering a combination of different predictor variables.
  3. Furthermore, in addition to Principal Components or raw atmospheric fields that provide a synoptic descriptor, local information of a particular variable or set of variables can be also included as a predictor in the calibration phase (e.g. the surface temperature in the nearest grid cells around a given station).

The large number of options required for a fine-tuning of a downscaling method requires a flexible, yet easily configurable interface, enabling users to launch complex experiments for testing different predictor setups. prepareData has been designed to this aim. A few reproducible examples are presented in this vignette.

A note on the terminology used

In the climate4R bundle (see e.g. Cofiño et al. 2017), atmospheric variables are stored in the so called data grids. In order to efficiently handle multiple variables used as predictors in downscaling experiments, "stacks" of grids are used. These are known as multiGrids, and can be obtained using the constructor makeMultiGrid from a set of -dimensionally consistent- grids.

Example data

Predictors

Daily data from the NCEP reanalysis (Kalnay et al. 1996) are used as example. In particular, a domain centered on the Iberian Peninsula is considered, and three variables (mean sea-level pressure psl, specific humidity at 850mb hus850 and air temperature at 850mb ta850) will be used as predictors. These are built-in example datasets from the package transformeR of the climate4R bundle. See for instance help("NCEP_Iberia_hus850", package = "transformeR") for further details.

library(transformeR)
data("NCEP_Iberia_hus850", "NCEP_Iberia_psl", "NCEP_Iberia_ta850")

Here we use function spatialPlot from visualizeR (also part of the climate4R bundle) to visualize grids and/or multiGrids.

require(visualizeR)
spatialPlot(climatology(NCEP_Iberia_psl), backdrop.theme = "coastline",
                main = "Mean DJF SLP (1983-2002)")
spatialPlot(climatology(NCEP_Iberia_hus850), backdrop.theme = "coastline",
                main = "Mean DJF hus850 (1983-2002)")
spatialPlot(climatology(NCEP_Iberia_ta850), backdrop.theme = "coastline",
                main = "Mean DJF ta850 (1983-2002)")

The grids are already spatially and temporally consistent, so they can be stacked in a multiGrid structure:

x <- makeMultiGrid(NCEP_Iberia_hus850, NCEP_Iberia_psl, NCEP_Iberia_ta850)

Predictands

The predictands correspond to the observations used for downscaling. These are typically meteorological station data or interpolated gridded datasets. In this example, we use a subset of stations used in a large intercomparison experiment of downscaling methods performed in the framework of the COST Action VALUE (Maraun et al. 2015). The target variable is daily precipitation.

data("VALUE_Iberia_tas")
y <- VALUE_Iberia_tas
spatialPlot(climatology(y), backdrop.theme = "countries", cex = 1.5,
                main = "Mean Winter daily precip (mm/day, 1983-2002")

Although the aim of this example is preparing only the predictors (the predictands are not manipulated), the information of the predictand is always required in order to ensure the spatio-temporal consistency of the experiment. The function handles internally non-overlapping temporal periods, ensuring that there is a perfect match between predictors and predictand prior to model calibration.

Worked examples

library(downscaleR)

Brief description of the arguments in prepareData

  • y: This is the predictand object. If required, it is a subset of the original one in order to ensure the temporal consistency with the predictors (this is achieved internally through the helper function getTemporalIntersection in transformeR).

  • global.vars: optional character vector with the short names of the variables of the input x multigrid to be retained as global predictors (use the getVarNames helper if not sure about variable names). This argument just produces a call to subsetGrid, but it is included here for better flexibility in downscaling experiments (predictor screening...). For instance, it allows to use some specific variables contained in x as local predictors and the remaining ones, specified in subset.vars, as either raw global predictors or to construct the combined PC.

  • combined.only: Optional, and only used if spatial.predictors parameters are passed. Should the combined PC be used as the only global predictor? Default to TRUE. Otherwise, the combined PC constructed with which.combine argument in prinComp is append to the PCs of the remaining variables within the grid.

  • spatial.predictors: Optional named list of arguments in the form argument = value, with the arguments to be passed to prinComp to perform Principal Component Analysis of the predictors grid (x).

  • local.predictors:named list of arguments in the form argument = value, with the following arguments:

    • vars: names of the variables in x to be used as local predictors.
    • fun: Optional. Aggregation function for the selected local neighbours. The aggregation function is specified as a list, indicating the name of the aggregation function in first place (as character), and other optional arguments to be passed to the aggregation function. For instance, to compute the average skipping missing values: fun = list(FUN= "mean", na.rm = TRUE).
    • n: Number of nearest neighbours to use. If a single value is introduced, and there is more than one variable in vars, the same value is used for all variables. Otherwise, this should be a vector of the same length as vars to indicate a different number of nearest neighbours for different variables.
  • extended.predictors: This is a parameter related to the extreme learning machine and reservoir computing framework where input data is randomly projected into a new space of size n. It is a named list of arguments in the form argument = value, with the following arguments:

    • n: A numeric value. Indicates the size of the random nonlinear dimension where the input data is projected.
    • module: A numeric value (Optional). Indicates the size of the mask's module. Belongs to a specific type of ELM called RF-ELM.

Case studies

Using the raw predictor variables

In this situation, the multigrid is directly passed to prepareData. Note that only x (predictors multigrid) and y (predictand, stations) are passed to the function, while the rest of arguments are left with their default value NULL (they could be omitted):

out <- prepareData(x = x,
                          y = y,
                          global.vars = NULL,
                          spatial.predictors = NULL,
                          local.predictors = NULL)
str(out)
str(attributes(out))

Using PCs as predictors

In this example, instead of using the raw fields as predictors, we will use principal components. The tuning of the principal component analysis can be undertaken by passing the different possible arguments admitted by transformeR::prinComp, which are detailed in the help of the function to the argument PCA of prepareData, in the form of a named list.

Note that here we will use the first 5 PCs of the 3 input variables contained in the multigrid, which are:

getVarNames(x)

If we use the first 5 PCs of the 3 input variables, hence:

out <- prepareData(x = x,
                          y = y,
                          spatial.predictors = list(n.eofs = c(10,5,5))
)

In this case, the output element pca contains the full output of prinComp, that will be needed in subsequent steps of the downscaling. There is also a global attribute PCA.pars containing metadata of the PCA options chosen. Note that now, the x.global element of the output contains the PC matrix of all the variables in the input grid, and thus it has $10+5+5=20$ columns corresponding to the number of PCs for each variable indicated in n.

str(out)

Introducing local predictors

Finally, it is possible to specify local predictors using the corresponding argument. It is passed as a named list with the following elements:

  • vars: the names of the variables to be used as local predictors
  • n: the number of nearest points/grid-boxes to the predcitand location to be used
  • fun: the aggregation function of the selected neighbours (if any). If NULL (the default), the nearest neighbours are not aggregated but just appended to the predictor matrix.

TIP: In order to select variables, either for local predictors or for multigrid subsetting via global.vars, the helper getVarNames from transformeR may be useful:

getVarNames(x)

Next, we will use the raw fields of air temperature at 850mb and sea-level pressure as global predictors (global.vars = c("ta@850", "psl")). In addition, specific humidity at 850mb will be used as a local predictor (vars = "hus@850"). In this example, we will consider the 4 closest points to the predictand location (n =4), and no aggregation of the neighbours is undertaken (fun = NULL, this argument could be ommited as this is the default).

out <- prepareData(x = x,
                          y = y,
                          global.vars = c("ta@850", "psl"),
                          local.predictors = list(vars = "hus@850",
                                                  n = 4,
                                                  fun = NULL)
)
str(out)

Instead of using all 4 neighbours sepparately, these can be aggregated, for instance using their average value as local predictor (fun = list(FUN = "mean")), or any other user-defined function, by adding all the additional arguments needed in the fun list (for instance, the 90th percentile value would be specified as fun = list(FUN = "quantile", prob = .9), etc).

out <- prepareData(x = x,
                          y = y,
                          global.vars = c("ta@850", "psl"),
                          local.predictors = list(vars = "hus@850",
                                                  n = 4,
                                                  fun = list(FUN = "mean"))
)
str(out)

Combining PCs with local predictors

This is probably the most typical configuration of predictors. The global information of the PCs is complemented with local information using an additional variable. For instance, in this example we retain temperature and sea-level pressure as global predictors, using their PCs, and include local information for specific humnifity at 850 mb. In this case, we indicate that we want to keep the PCs explaining no less than 95% of the total variance for each global variable (v.exp = c(0.95, 0.95); note that in the previous PCA example we indicated the number of PCs (argument n.eofs), instead of the amount of explained variance):

out <- prepareData(x = x,
                          y = y,
                          global.vars = c("ta@850", "psl"),
                          spatial.predictors = list(v.exp = c(.95, .95)),
                          local.predictors = list(vars = "hus@850",
                                                  n = 4,
                                                  fun = NULL)
)
str(out)

The x.global matrix has now 7 columns, corresponding to 3 PCs retained for sea-level pressure plus 3 for air temperature.

Another option is to use the joint/combined PCs as predictors. To do so, we have to pass to the argument which.combine the variables we want to be joined when calculating PCs.

out <- prepareData(x = x,
                          y = y,
                          spatial.predictors = list(v.exp = 0.95, which.combine = getVarNames(x)),
                          local.predictors = list(vars = "hus@850",
                                                  n = 4,
                                                  fun = NULL)
)
str(out)

References

  • Cofiño, A.S., Bedia, J., Iturbide, M., Vega, M., Herrera, S., Fernández, J., Frías, M.D., Manzanas, R., Gutiérrez, J.M., 2017. The ECOMS User Data Gateway: Towards seasonal forecast data provision and research reproducibility in the era of Climate Services. Climate Services. doi:10.1016/j.cliser.2017.07.001

  • Kalnay, E., Kanamitsu, M., Kistler, R., Collins, W., Deaven, D., Gandin, L., Iredell, M., Saha, S., White, G., Woollen, J., Zhu, Y., Leetmaa, A., Reynolds, R., Chelliah, M., Ebisuzaki, W., Higgins, W., Janowiak, J., Mo, K.C., Ropelewski, C., Wang, J., Jenne, R., Joseph, D., 1996. The NCEP/NCAR 40-Year Reanalysis Project. Bulletin of the American Meteorological Society 77, 437–471. doi:10.1175/1520-0477(1996)077<0437:TNYRP>2.0.CO;2

  • Maraun, D., Widmann, M., Gutiérrez, J.M., Kotlarski, S., Chandler, R.E., Hertig, E., Wibig, J., Huth, R., Wilcke, R.A.I., 2015. VALUE: A framework to validate downscaling approaches for climate change studies. Earth’s Future 3, 2014EF000259. doi:10.1002/2014EF000259


<-- Home page of the Wiki

Session info

print(sessionInfo(package = c("transformeR", "downscaleR")))

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 14.04.5 LTS
## 
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=es_ES.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=es_ES.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=es_ES.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## character(0)
## 
## other attached packages:
## [1] transformeR_1.3.3 downscaleR_3.0.0 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.15        compiler_3.4.3      highr_0.6          
##  [4] methods_3.4.3       bitops_1.0-6        iterators_1.0.8    
##  [7] utils_3.4.3         tools_3.4.3         grDevices_3.4.3    
## [10] deepnet_0.2         digest_0.6.13       dotCall64_0.9-5.2  
## [13] evd_2.3-2           gtable_0.2.0        evaluate_0.10.1    
## [16] lattice_0.20-35     Matrix_1.2-7.1      foreach_1.4.3      
## [19] yaml_2.1.16         parallel_3.4.3      spam_2.1-2         
## [22] akima_0.6-2         gridExtra_2.2.1     stringr_1.2.0      
## [25] knitr_1.18          raster_2.6-7        gridGraphics_0.2   
## [28] graphics_3.4.3      datasets_3.4.3      stats_3.4.3        
## [31] fields_9.6          maps_3.2.0          rprojroot_1.3-2    
## [34] grid_3.4.3          glmnet_2.0-13       base_3.4.3         
## [37] rmarkdown_1.8       sp_1.2-7            magrittr_1.5       
## [40] backports_1.1.2     codetools_0.2-15    htmltools_0.3.6    
## [43] MASS_7.3-44         abind_1.4-5         stringi_1.1.5      
## [46] RCurl_1.95-4.10     RcppEigen_0.3.3.3.1
Clone this wiki locally