Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Return more things #18

Open
dmi3kno opened this issue May 20, 2021 · 10 comments
Open

Return more things #18

dmi3kno opened this issue May 20, 2021 · 10 comments

Comments

@dmi3kno
Copy link

dmi3kno commented May 20, 2021

Awesome package. I switched to using fmcmc and love it! Really streamlined interface, burgeoning functionality, stellar performance.

There's one feature I miss from adaptMCMC and that is the ability to return multiple things from the logdensity function. Here's a quote describing the function argument in adaptMCMC::MCMC()

function that returns a value proportional to the log probability density to sample from. Alternatively it can be a function that returns a list with at least one element named log.density.
In some cases the function p may not only calculate the log density but return a list containing also other values. For example, if p is a log posterior one may be also interested to store the corresponding prior and likelihood values. The function must either return always a scalar or always a list, however,the length of the list may vary.

I understand that parsing the returned list will cost performance, but it could enable more interesting output from MCMC and, in particular, allow users to transform parameters (similar to transformed parameters block in STAN).

@gvegayon
Copy link
Member

Interesting! This sounds like a good feature to add. Perhaps we could take advantage of the fmcmc::last_() function. Values from the last run are stored in a special environment.

For the moment, take a look at this example in which I compute the priors and keep track of the last value in LAST_MCMC

library(fmcmc)

set.seed(1231)
n <- 1e3
pars <- c(mean = 2.6, sd = 3)

# Generating data and writing the log likelihood function
D <- rnorm(n, pars[1], pars[2])
fun <- function(x) {
    
    # Computing the prior, and storing it in LAST_MCMC  
    priorval <- dnorm(x, c(2, 3), log = TRUE)
    LAST_MCMC$priorval <- priorval
    
    sum(dnorm(D, x[1], x[2], log = TRUE)) + sum(priorval)
}

# Calling MCMC, but first, loading the coda R package for
# diagnostics
library(coda)
ans <- MCMC(
    fun, initial = c(mu=1, sigma=1), nsteps = 2e3,
    kernel = kernel_normal_reflective(scale = .1, ub = 10, lb = 0)
)

last_("priorval")
#>         mu      sigma 
#> -1.0795954 -0.9221123

Created on 2021-05-20 by the reprex package (v2.0.0)

I will add this example to the manual, thanks!

@dmi3kno
Copy link
Author

dmi3kno commented May 20, 2021

This looks cool, but right now it can store only single values, right? I was thinking of returning values of several (derived) parameters at every iteration, so that the final object would not only contain parameter draws, but also these computed values for every iteration. I am doing some parameter transforms and would love to report my transformed parameter alonside the pretty boring uniform prior. You could probably report "generated values" as well, if you RNG a a random value at every iteration using the drawn parameter value. This could be useful for prior predictive checks. BTW, STAN reports lp__ which is a vector of log-likelihoods, did you ever consider recording it into the fit object as well?

@gvegayon
Copy link
Member

gvegayon commented Jun 8, 2021

That is possible too, for example, if you wanted to keep values for each iteration you'd need to replace the fun with the following:

library(fmcmc)

set.seed(1231)
n <- 1e3
pars <- c(mean = 2.6, sd = 3)

# Generating data and writing the log likelihood function
D <- rnorm(n, pars[1], pars[2])
fun <- function(x) {
  
  # Computing the prior, and storing it in LAST_MCMC  
  priorval <- dnorm(x, c(2, 3), log = TRUE)
  LAST_MCMC$priorval <- rbind(LAST_MCMC$priorval, priorval)
  
  sum(dnorm(D, x[1], x[2], log = TRUE)) + sum(priorval)
}

# Calling MCMC, but first, loading the coda R package for
# diagnostics
library(coda)
ans <- MCMC(
  fun, initial = c(mu=1, sigma=1), nsteps = 2e3,
  kernel = kernel_normal_reflective(scale = .1, ub = 10, lb = 0)
)

head(last_("priorval"))
#>                 mu     sigma
#> priorval -1.418939 -2.918939
#> priorval -1.418939 -2.918939
#> priorval -1.429507 -2.657561
#> priorval -1.429036 -2.526366
#> priorval -1.547106 -2.339950
#> priorval -1.688518 -2.572419

Created on 2021-06-07 by the reprex package (v2.0.0)

Regarding returning the vector of computed likelihoods, I can see how beneficial that could be, especially if you are dealing with a function that is particularly slow to compute. I'll add that to the next iteration.

@dmi3kno
Copy link
Author

dmi3kno commented Jun 23, 2021

Hi and thank you for this. Extremely helpful.
Couple of questions:

  • LAST_MCMC seems seems to contain a couple of pre-warmup iterations with initial vlaues, therefore total object size in LAST_MCMC does not match the size of the posterior sample? Can you confirm that the odd samples are always the beginning, so I could discard them? The objective is to merge the samples with the posterior using posterior::bind_draws() or dplyr::left_join() once I restore index in the prior draws.
library(fmcmc)
library(coda)
set.seed(1231)
n <- 1e3
pars <- c(mean = 2.6, sd = 3)

# Generating data and writing the log likelihood function
D <- rnorm(n, pars[1], pars[2])

fun <- function(x) {
  LAST_MCMC$prior <- rbind(LAST_MCMC$prior, x)
  # Computing the prior, and storing it in LAST_MCMC  
  priorval <- dnorm(x, c(2, 3), log = TRUE)
  LAST_MCMC$priorval <- rbind(LAST_MCMC$priorval, priorval)
  loglik <- sum(dnorm(D, x[1], x[2], log = TRUE)) + sum(priorval)
  LAST_MCMC$loglik <- rbind(LAST_MCMC$loglik, loglik)
  loglik
}


ans <- MCMC(
  fun, initial = c(mu=1, sigma=1), nsteps = 2e3,
  kernel = kernel_normal_reflective(scale = .1, ub = 10, lb = 0)
)

dim(last_("prior"))
#> [1] 2002    2
head(last_("prior"))
#>          mu    sigma
#> x 1.0000000 1.000000
#> x 1.0000000 1.000000
#> x 0.9894871 1.135263
#> x 0.9899534 1.206998
#> x 0.8791363 1.314170
#> x 0.7593717 1.181495
dim(last_("priorval"))
#> [1] 2002    2
head(last_("priorval"))
#>                 mu     sigma
#> priorval -1.418939 -2.918939
#> priorval -1.418939 -2.918939
#> priorval -1.429507 -2.657561
#> priorval -1.429036 -2.526366
#> priorval -1.547106 -2.339950
#> priorval -1.688518 -2.572419
dim(last_("loglik"))
#> [1] 2002    1
head(last_("loglik"))
#>             [,1]
#> loglik -6431.756
#> loglik -6431.756
#> loglik -5336.329
#> loglik -4902.609
#> loglik -4495.771
#> loglik -5317.829
head(ans)
#> Markov Chain Monte Carlo (MCMC) output:
#> Start = 1 
#> End = 7 
#> Thinning interval = 1 
#>          mu    sigma
#> 1 0.9894871 1.135263
#> 2 0.9899534 1.206998
#> 3 0.8791363 1.314170
#> 4 0.8791363 1.314170
#> 5 0.8791363 1.314170
#> 6 0.8791363 1.314170
#> 7 1.1162015 1.397068

Created on 2021-06-23 by the reprex package (v2.0.0)

  • LAST_MCMC is working only in multicore=FALSE
  • Things become a little more fuzzy with nchains>1. I would need to separate the last_() by chain and then discard the first samples from each chain, as far as I could understand.
  • Is the only way to empty last_() is to restart the session? Is there some "purging" function I could call to empty the log inside LAST_MCMC. I did the following prior to sampling (which looks hacky, but ok)
LAST_MCMC$prior <- NULL
LAST_MCMC$priorval <- NULL
LAST_MCMC$loglik <- NULL

Again, thanks for this tip. This looks really helpful and extendable.

@gvegayon
Copy link
Member

gvegayon commented Jun 24, 2021

All great points. Let me go one by one:

Sample size not matching

Yes, there is a difference. The current version actually calls the main function two times before entering the MCMC loop, thus, if running without thinning o burn-in, you should have two extra observations

Still, you can know the place of the step by accessing the calling environment with parent.frame() (a hack :P). There you should be able to find a variable named i, the step number:

library(fmcmc)

set.seed(1231)
n <- 1e3
pars <- c(mean = 2.6, sd = 3)

# Generating data and writing the log likelihood function
D <- rnorm(n, pars[1], pars[2])
fun <- function(x) {
  
  # Getting the calling frame. This function is wrapped by `MCMC`, so we need
  # to go two levels up to take a look at the environment where `MCMC` is calling
  # the function.
  mcmc_env <- parent.frame(2) # This should contain -i-, the number of step
  
  # Computing the prior, and storing it in LAST_MCMC  
  priorval <- dnorm(x, c(2, 3), log = TRUE)
  LAST_MCMC$priorval <- rbind(
    LAST_MCMC$priorval,
    c(ifelse(exists("i", envir = mcmc_env), mcmc_env$i, NA), priorval)
  )
  
  sum(dnorm(D, x[1], x[2], log = TRUE)) + sum(priorval)
}

# Calling MCMC, but first, loading the coda R package for
# diagnostics
library(coda)
ans <- MCMC(
  fun, initial = c(mu=1, sigma=1), nsteps = 2e3,
  kernel = kernel_normal_reflective(scale = .1, ub = 10, lb = 0)
)

head(last_("priorval"))
#>                mu     sigma
#> [1,] NA -1.418939 -2.918939
#> [2,]  2 -1.429507 -2.657561
#> [3,]  3 -1.429036 -2.526366
#> [4,]  4 -1.547106 -2.339950
#> [5,]  5 -1.688518 -2.572419
#> [6,]  6 -1.786716 -2.669918

Created on 2021-06-24 by the reprex package (v2.0.0)

If you run the chain using thin = 1 and burnin = 0, you should get the following:

ans <- MCMC(
  fun, initial = c(mu=1, sigma=1), nsteps = 2e3,
  kernel = kernel_normal_reflective(scale = .1, ub = 10, lb = 0),
  burnin = 0,
  thin = 1
)
nrow(last_("priorval"))
2002
nrow(ans)
2000

THIS IS SOMETHING I HAVE FIXED ON THE DEVELOPMENT VERSION.

LAST_MCMC only works with multicore = FALSE

Yes, that is true. I need to figure out a way to solve that. Perhaps, just like the mcmc.list object, I could return a sort list for each created element. What do you think? Alternatively, you can always create your own cluster object and recover the LAST_MCMC object on your own, e.g.:

library(fmcmc)

set.seed(1231)
n <- 1e3
pars <- c(mean = 2.6, sd = 3)

# Generating data and writing the log likelihood function
D <- rnorm(n, pars[1], pars[2])
fun <- function(x) {
  
  # Getting the calling frame. This function is wrapped by `MCMC`, so we need
  # to go two levels up to take a look at the environment where `MCMC` is calling
  # the function.
  mcmc_env <- parent.frame(2) # This should contain -i-, the number of step
  
  # Computing the prior, and storing it in LAST_MCMC  
  priorval <- dnorm(x, c(2, 3), log = TRUE)
  LAST_MCMC$priorval <- rbind(
    LAST_MCMC$priorval,
    c(ifelse(exists("i", envir = mcmc_env), mcmc_env$i, NA), priorval)
  )
  
  sum(dnorm(D, x[1], x[2], log = TRUE)) + sum(priorval)
}

# Calling MCMC, but first, loading the coda R package for
# diagnostics
library(coda)

library(parallel)
cl <- makePSOCKcluster(2)
clusterEvalQ(cl, library(fmcmc)) # to create the LAST_FMCMC obj
#> [[1]]
#> [1] "fmcmc"     "stats"     "graphics"  "grDevices" "utils"     "datasets" 
#> [7] "methods"   "base"     
#> 
#> [[2]]
#> [1] "fmcmc"     "stats"     "graphics"  "grDevices" "utils"     "datasets" 
#> [7] "methods"   "base"
clusterExport(cl, c("D"))

ans <- MCMC(
  fun, initial = c(mu=1, sigma=1), nsteps = 2e3,
  kernel = kernel_normal_reflective(scale = .1, ub = 10, lb = 0),
  burnin = 0,
  thin   = 1,
  # Running multicore with my own cluster
  multicore = TRUE,
  cl        = cl,
  nchains   = 2
)
#> Warning: While using multiple chains, a single initial point has been passed via
#> `initial`: c(1, 1). The values will be recycled. Ideally you would want to start
#> each chain from different locations.

clusterEvalQ(cl, head(last_("priorval")))
#> [[1]]
#>                mu     sigma
#> [1,] NA -1.418939 -2.918939
#> [2,] NA -1.418939 -2.918939
#> [3,]  1 -1.373670 -2.662039
#> [4,]  2 -1.426267 -3.076379
#> [5,]  3 -1.304141 -2.593316
#> [6,]  4 -1.275321 -2.866264
#> 
#> [[2]]
#>                mu     sigma
#> [1,] NA -1.418939 -2.918939
#> [2,] NA -1.418939 -2.918939
#> [3,]  1 -1.193565 -2.714025
#> [4,]  2 -1.194003 -2.840829
#> [5,]  3 -1.126817 -2.769653
#> [6,]  4 -1.095577 -2.511297
stopCluster(cl)

Created on 2021-06-24 by the reprex package (v2.0.0)

Things become fuzzy when nchains > 1

This is very related to the previous question. Again, this could also be hacked by a nchain variable to the data frame or simply keeping track of when the next i == 1 shows, so that you can tell when the next chain started.

Flush things in LAST_MCMC

That is a great suggestion, I can certainly add a function last_flush() to do exactly that.

HIH

@dmi3kno
Copy link
Author

dmi3kno commented Jun 24, 2021

Hi, thank you for detailed answers.

Sample size not matching

Still, you can know the place of the step by accessing the calling environment with parent.frame() (a hack :P). There you should be able to find a variable named i, the step number:

I think this could be made slightly more accessible, i.e. without requiring the parent frame hack. Since you are saving some samples in LAST_MCMC anyways, could that list contain an draw_index element, which could hold precisely that: two NAs followed by the draw number. You could think about this element of the LAST_MCMC as a set of identifying columns in {posterior}: c(.chain, .iteration, .draw). Perhaps it should be a matrix with these three columns (to mimic the structure of the coda-like object created by Stan).

If you run the chain using thin = 1 and burnin = 0, you should get the following:
nrow(last_("priorval"))
2002
nrow(ans)
2000
THIS IS SOMETHING I HAVE FIXED ON THE DEVELOPMENT VERSION.

If those initial two samples are for some reason important, I have no objections. I just want to be able to merge generated variables with posterior draws easily (and indexing via c(.chain, .iteration, .draw) can help me do that).

LAST_MCMC only works with multicore = FALSE

Perhaps, just like the mcmc.list object, I could return a sort list for each created element. What do you think?

Again, the more the package resembles the "senior brothers" Stan and JAGS the better. Mimicking the structure of the mcmc.list is a great idea.

Alternatively, you can always create your own cluster object and recover the LAST_MCMC object on your own, e.g.

This whole "make your own cluster" trick is a thing from the past. Similar to {rstan} and other senior packages, all the user has to do is reserve cores, the rest should be done more or less automatically (especially now that Henrik has streamlined multicore experience so much with the future package).

Things become fuzzy when nchains > 1

Again, this could also be hacked by a nchain variable to the data frame or simply keeping track of when the next i == 1 shows, so that you can tell when the next chain started.

And my proposal would cover the same. Keep the "chain id" by the "draw id".

Flush things in LAST_MCMC

I can certainly add a function last_flush() to do exactly that.

Not sure I like the name, but ok. Why can't you flush automatically every time the fmcmc::MCMC() is called?

@dmi3kno
Copy link
Author

dmi3kno commented Jun 28, 2021

I think we need a non-pure function

  • add_to_mcmc_output(x, name), which should take the calculated value and tuck it under the LAST_MCMC$user_output as one of the sub-elements. Needless to say that every element of that list should be indexed (preferrably a dataframe with c(.chain, .iteration, .draw)) identifier otherwise different outputs may not be mergeable afterwards. This function can be called many times each time adding a new output element. It does all the rbind magic behind.
  • retrieve_mcmc_output(names=NULL) with optional vector of names for retrieving the outputs from the LAST_MCMC$user_output and adding it as variables to the coda-compatible output object.

@gvegayon
Copy link
Member

Still haven't figure out the right names, but I think this is close to being closed (pun intended). See this new vignette that will be included in version 0.5-0 https://uscbiostats.github.io/fmcmc/articles/advanced-features.html#saving-information-as-it-happens-1

@dmi3kno
Copy link
Author

dmi3kno commented Aug 8, 2022

Ok. I created a user function which I find useful and which I would love to see in fmcmc

#' Add userdata to the fmcmc output
#' 
#' Combines list of dataframes produced by fmcmc::get_userdata()
#' with mcmc.list() produced by fmcmc::MCMC ensuring that the 
#' chains and iterations match
#'
#' @param x mcmc.list to add userdata to. Note, that userdata 
#' is taken from the environment (whatever fmcmc::get_userdata()
#' returns)
#'
#' @return combined mcmc.list
#' @export
#' @importFrom coda mcpar mcmc as.mcmc.list
#' @examples
add_userdata <- function(x){
  ud <- fmcmc::get_userdata()
  stopifnot("Number of chains do not match"=
              (length(x)==length(ud)))
  res <- mapply(function(x,y){
    iters_y <- as.numeric(rownames(x))
    min_iters_y <- min(iters_y)
    max_iters_y <- max(iters_y)
    mcpar_x <- coda::mcpar(x)
    mcpar_y <- c(min_iters_y, max_iters_y, 1)
    stopifnot("Iterations are not matching"=
                all.equal(mcpar_x, mcpar_y))
    my <- as.matrix(y, dimnames=list(iters_y,colnames(y)))
    coda::mcmc(cbind(x,my), start = min_iters_y, end = max_iters_y)
  },x=x, y=ud, SIMPLIFY = FALSE)
  coda::as.mcmc.list(res)
}

As seen from the documentation, this function merges the userdata and the results from sampling into a single object. What do you think?

@gvegayon
Copy link
Member

gvegayon commented Aug 9, 2022

Thanks for the suggestion! I am not entirely sure about the name, but I think we could include it in the next version. Would you be able to make a pull request with this, including a test? LMK if you need help with it.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

2 participants