Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

[Feature Request] Better documentation of the estimated parameters for Gaussian processes #1725

Open
jr-leary7 opened this issue Jan 15, 2025 · 9 comments

Comments

@jr-leary7
Copy link

Hello,

First of all, just want to say I greatly appreciate all the work on brms; it is fantastic software and I cannot imagine how much effort has gone into maintaining and improving it. You're doing a great job !!

My request is that the documentation of how Gaussian processes work in brms be fleshed out a bit. As it currently stands, I've visited pretty much every GP thread on the MC-Stan dev forums (and posted a few questions of my own), read the documentation of the gp() function in brms, and even bought Bayesian Data Analysis 3rd Edition and read the chapter on GPs (Ch. 20 if I remember correctly). Despite all this, I'm still rather confused as to what the syntax brms uses is. Some of the parameters make sense e.g., lscale is the estimated length-scale of the GP. Some are less clear, such as the parameters named following zgp_gpxy[i] (when fitting a multivariate GP with location variables named x and y). I assume the numeric indicator i is a function of the number of basis functions approximating the GP $k$, and the number of dimensions in the multivariate GP, but I'm unsure what z denotes or how to summarize the parameter information.

For context (not expecting you to solve my current problem here, as I'm aware the MC-Stan forums are the more appropriate place to ask for help), I am working with spatial transcriptomics data and I'm trying to model gene expression (a Negative-binomially distributed count) as a function of a multivariate GP (approximated as a Hilbert space by $k = 20$ basis functions) based on location in Euclidean space, with individual GPs fit for each gene (an unordered factor). Thus, the model I'm fitting is as follows below, assuming the data are stored in a variable called expr_df:

# load libraries 
library(brms) 
library(cmdstanr)
# set up formula and fit model 
model_formula <- bf(count ~ 1 + gp(x, y, by = gene, k = 20L, cov = "exp_quad", scale = TRUE), 
                    shape ~ 1)
brms_fit <- brm(model_formula, 
                data = expr_df,
                family = brms::negbinomial(link = "log", link_shape = "log"),
                chains = 1L,
                iter = 1500L,
                warmup = 250L,
                thin = 5L,
                cores = 1L,
                threads = 4L,
                normalize = FALSE, 
                silent = 1,
                backend = "cmdstanr",
                algorithm = "meanfield",
                stan_model_args = list(stanc_options = list("O1")), 
                seed = 312)

I hope this a reasonable request (and please let me know if it isn't !); the GP additions in the latest dev version of brms are awesome and I know I and others find them quite useful. I'm happy to contribute if possible r.e. sketching out other use cases, contributing code for examples, etc. if that would be helpful.

Thanks !

Jack
University of Florida

@paul-buerkner
Copy link
Owner

I think requests to extend the doc are very reasonable. I am not sure though when I will have time for it myself at the moment. If you have concrete thoughts on what to add based on what you had to figure out yourself from other sources, I would be happy to review a PR adding some more details to the docs.

@jr-leary7
Copy link
Author

jr-leary7 commented Jan 17, 2025

that makes a lot of sense - i know maintaining & expanding software while keeping up an academic career is a huge task. i think something along the lines of even just explaining what the estimated parameters are and how to interpret them would be very helpful, and the rest could be left to the forums. for example, adding a short table to the gp() docs like so (i'm leaving the parts unclear to me marked with a ?):

parameter name notation interpretation
lscale length-scale $\ell$ distance over which points in the input space are strongly correlated
sdgp standard deviation $\sigma$ typical magnitude of variation in the GP
zgp ? $z$ ?

some very basic guidance on prior choice for relevant parameters would be useful too -- i could certainly gather some of this info from the forums and the literature and compile a summary.

@paul-buerkner
Copy link
Owner

yeah, makes sense. This may be something I can quickly add before the next release. If you happen to have time for a PR beforehand (even just the parts that you know), feel free and I can then edit in the remaining parts.

@paul-buerkner paul-buerkner added this to the brms 2.23.0 milestone Jan 17, 2025
@jr-leary7
Copy link
Author

thanks so much ! where would be your preferred place for me to put the documentation ? the first thing that comes to mind is in the roxygen doc section of the gp() function, but i can do whatever works best for you.

@paul-buerkner
Copy link
Owner

paul-buerkner commented Jan 17, 2025 via email

@jr-leary7
Copy link
Author

as shown above i added some details on each hyperparameter to the docs; let me know if anything looks amiss and i can correct it.

@jr-leary7
Copy link
Author

just realized i made a slight mistake in the code sent in the PR, i used $\rho$ to refer to the length-scale, which i should have denoted $\ell$, in lines 82-83 of formula-gp.R. This is because I had just a few minutes prior written a line about how the MC-Stan docs denote the length-scale as $\rho$ instead.

@paul-buerkner
Copy link
Owner

paul-buerkner commented Jan 20, 2025 via email

@jr-leary7
Copy link
Author

All done !

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

2 participants