-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Memory limits + enable fine-tuning for parallel execution of small batches #2
Comments
Thanks! I'll look into this and get back to you. |
I think I have some useful interim notes here. What follows here isn't a reprex or clear demo; just some ideas on where further code profiling/iteration in a full reprex might be most helpful. If you think I'm on the right track, I'll dive in further. I've profiled my original issue on memory failures more carefully using a fork here (https://github.com/sheffe/forestError) that changes very little from current master. It just plugs the A couple of early conclusions: The results are definitely deterministic, so that splitting batches of new predictions across cores to shrink the total size of the matrices works reasonably well at preventing memory failures. However, it's much slower to run because we make a call to countOOBCohabitantsTrainPar for each test batch, and that's far more impactful to overall time cost than the decision to single-core everything. And even with careful batchwise iteration, I hit out-of-memory errors on problems that I consider moderately sized -- ~2M observations in a training set, 250 trees, and a pretty aggressively increased min.node.size to shrink the number of unique terminal nodes (20, 50, and 100 min.node.size all failed outside of small trial cases). I'm working on a short-term project where this would be extremely handy, and planning the following experiments. It looks like the calls to |
Actually, this turned out to be more straightforward than I thought! The calculation for The solution below gets to the same quantities, but eliminates the adjacency matrix (and the C++ that generates it) in favor of an edgelist representations of train/test that we can join (a table of tree/node/error for train, tree/node/rowid for test). This previews a few extra tricks for speed and memory conservation, too:
I haven't written enough around the function to profile them against each other consistently, but you can try (for instance) rbind-ing a thousand noised-up copies of train and it still runs very quickly. library(magrittr)
library(data.table)
library(randomForest)
data(airquality)
# remove observations with missing predictor variable values
airquality <- airquality[complete.cases(airquality), ]
# get number of observations and the response column index
n <- nrow(airquality)
response.col <- 1
# split data into training and test sets
train.ind <- sample(1:n, n * 0.9, replace = FALSE)
Xtrain <- airquality[train.ind, -response.col]
Ytrain <- airquality[train.ind, response.col]
Xtest <- airquality[-train.ind, -response.col]
Ytest <- airquality[-train.ind, response.col]
rf <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5,
ntree = 500, keep.inbag = TRUE)
### renaming vars, as if we're inside the quantForestError function
forest <- rf
X.train <- Xtrain
X.test <- Xtest
Y.train <- Ytrain
test.pred.list <- predict(forest, X.test, nodes = TRUE)
# get terminal nodes of all observations
train.terminal.nodes <- attr(predict(forest, X.train, nodes = TRUE), "nodes")
test.terminal.nodes <- attr(test.pred.list, "nodes")
# get number of times each training observation appears in each tree
bag.count <- forest$inbag
# get the OOB prediction error of each training observation
oob.errors <- forest$y - forest$predicted
# get test observation predictions
attributes(test.pred.list) <- NULL
test.preds <- test.pred.list
train.terminal.nodes[bag.count != 0] <- NA
long_test_nodes <-
as.data.table(test.terminal.nodes)[, rowid_test := .I] %>%
melt(id.vars = "rowid_test", measure.vars = patterns("\\d+"),
variable.name = "tree", value.name = "terminal_node")
long_train_nodes <-
as.data.table(train.terminal.nodes) %>%
.[, `:=`(rowid_train = .I,
oob_error = oob.errors)] %>%
melt(id.vars = c("oob_error", "rowid_train"),
measure.vars = patterns("\\d+"),
variable.name = "tree",
value.name = "terminal_node",
variable.factor = FALSE,
na.rm = TRUE) %>%
.[, .(node_mspe = mean(oob_error^2),
node_bias = mean(oob_error),
node_errs = list(oob_error),
node_obsN = .N),
by = c("tree", "terminal_node")]
setkey(long_test_nodes, tree, terminal_node)
setkey(long_train_nodes, tree, terminal_node)
oob_error_statistics <- long_train_nodes %>%
.[long_test_nodes, .(rowid_test, node_mspe, node_bias, node_errs, node_obsN), by = .EACHI] %>%
.[, .(wtd_mspe = weighted.mean(node_mspe, node_obsN, na.rm = TRUE),
wtd_bias = weighted.mean(node_bias, node_obsN, na.rm = TRUE),
all_errs = list(unlist(node_errs))),
keyby = "rowid_test"] (^ that last step creating the list-column |
(apologies for spamming this issue tonight, coding in bursts around a screaming baby...) I benchmarked this alternate implementation against two dataset sizes (50 and 1000 rbind-ed noised-up copies of The median times in a microbenchmark for the 5,500-row sample:
So, as we'd expect with the large matrix memory allocation we're scaling up at O(n^2) in the original implementation, but roughly linearly when it's an edgelist. The real memory crunch happens with increasing counts of unique test observations and terminal nodes. For the 110,000-row dataset, I was only able to run the 100-tree forest for both approaches. Median time for forestError was ~630 seconds; the DT example ran in ~21 seconds. Over the 1,000 tree forest, the DT example ran in 220 seconds, so the linear scaling trend continued. At this point I've gone pretty far afield from your original code / don't want to send you a huge PR haring off in a new direction. Tomorrow I'll polish this idea further and push up to a fork so you can explore at leisure if interested. I think it'll basically require introducing a new dependency to go the edgelist route -- I used library(magrittr)
library(data.table)
library(randomForest)
library(forestError)
data(airquality)
# remove observations with missing predictor variable values
airquality <- airquality[complete.cases(airquality), ]
# Add some random noise to 1000 copies of the data
set.seed(1234)
airquality_big <- lapply(1:1000, function(x){
as.data.table(airquality) %>%
.[, lapply(.SD, function(x) x + rnorm(length(x), 0, 4)),
.SDcols = names(airquality)]
}) %>%
data.table::rbindlist() %>%
as.data.frame()
# get number of observations and the response column index
n <- nrow(airquality_big)
response.col <- 1
# split data into training and test sets
train.ind <- sample(1:n, n * 0.7, replace = FALSE)
Xtrain <- airquality_big[train.ind, -response.col]
Ytrain <- airquality_big[train.ind, response.col]
Xtest <- airquality_big[-train.ind, -response.col]
Ytest <- airquality_big[-train.ind, response.col]
rf1 <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5,
ntree = 100, keep.inbag = TRUE)
rf2 <- randomForest::randomForest(Xtrain, Ytrain, nodesize = 5,
ntree = 1000, keep.inbag = TRUE)
altimplementation <- function(forest, X.train, X.test, Y.train){
### Identical to forestError implementation ###
test.pred.list <- predict(forest, X.test, nodes = TRUE)
# get terminal nodes of all observations
train.terminal.nodes <- attr(predict(forest, X.train, nodes = TRUE), "nodes")
test.terminal.nodes <- attr(test.pred.list, "nodes")
# get number of times each training observation appears in each tree
bag.count <- forest$inbag
# get the OOB prediction error of each training observation
oob.errors <- forest$y - forest$predicted
# get test observation predictions
attributes(test.pred.list) <- NULL
test.preds <- test.pred.list
### Start new code ###
# NA-out the inbag set before melting to long
train.terminal.nodes[bag.count != 0] <- NA
long_test_nodes <-
as.data.table(test.terminal.nodes)[, rowid_test := .I] %>%
melt(id.vars = "rowid_test", measure.vars = patterns("\\d+"),
variable.name = "tree", value.name = "terminal_node")
long_train_nodes <-
as.data.table(train.terminal.nodes) %>%
.[, `:=`(oob_error = oob.errors)] %>%
melt(id.vars = c("oob_error"),
measure.vars = patterns("\\d+"),
variable.name = "tree",
value.name = "terminal_node",
variable.factor = FALSE,
na.rm = TRUE) %>%
.[, .(node_mspe = mean(oob_error^2),
node_bias = mean(oob_error),
node_errs = list(oob_error),
node_obsN = .N),
by = c("tree", "terminal_node")]
setkey(long_test_nodes, tree, terminal_node)
setkey(long_train_nodes, tree, terminal_node)
oob_error_statistics <- long_train_nodes %>%
.[long_test_nodes, .(rowid_test, node_mspe, node_bias, node_errs, node_obsN), by = .EACHI] %>%
.[, .(wtd_mspe = weighted.mean(node_mspe, node_obsN, na.rm = TRUE),
wtd_bias = weighted.mean(node_bias, node_obsN, na.rm = TRUE),
all_errs = list(unlist(node_errs))
),
keyby = "rowid_test"]
oob_error_statistics
}
data.table::setDTthreads(threads = 1)
microbenchmark::microbenchmark(
altimplementation(forest = rf1, X.train = Xtrain, X.test = Xtest, Y.train = Ytrain),
quantForestError(forest = rf1, X.train = Xtrain, X.test = Xtest, Y.train = Ytrain, what = c("bias", "mspe"), n.cores = 1),
times = 5
)
# quantForestError breaks here, so commented out
microbenchmark::microbenchmark(
altimplementation(forest = rf2, X.train = Xtrain, X.test = Xtest, Y.train = Ytrain),
#quantForestError(forest = rf2, X.train = Xtrain, X.test = Xtest, Y.train = Ytrain, what = c("bias", "mspe"), n.cores = 1),
times = 5
) |
Here's a working implementation of everything except the PDF/CDF function returns on my fork. I included some tweaks (multiple prediction interval alphas) specific to my own use-cases. As I noted, not planning to send you a huge PR 'til I know what's helpful. |
This looks great! I've reviewed your fork and would be happy to work on merging this with the main branch if you open a pull request and allow me to make commits to it. Of course, you would probably also want to keep a separate copy of your version since it's specifically tailored to your needs. At a high level, do you see any obvious opportunities for parallelization of the computations in your version? Allowing for parallelization, as well as returning the PDF and CDF functions, are two features I plan to add if possible before merging this version with the main branch. As for the original quantForestError function, I've also noticed that the parallelization for some reason doesn't yield substantial improvements in the computation time in some cases, but this seems to be a pretty nuanced issue; I haven't been able to really characterize the situations in which parallelization works vs. the situations in which it doesn't, but it seems like a main determinant of whether parallelization improves the runtime or not is actually the number of trees in the forest. I'm looking into it, but the issue may naturally resolve itself or otherwise become moot under your edgelist approach. I can do some testing on it. Thanks a bunch for these improvements! |
This is one of the most exciting RF papers I've read in a while. I've homebrewed a lot of conditional bias estimators in the past for industry applications, but nothing with a real theoretical foundation, and in early trials this is radically improving on my prior work. Thank you!
A practical issue I'm running into: this method is quite memory-intensive (by necessity) when run for large
X.test
samples and/or whenforest
has many unique terminal nodes (e.g. from a high number ofX.train
samples or many trees or both). Calls toquantForestError
that are too large result inError: vector memory exhausted (limit reached?)
.I'm concluding from your paper and some experiments with the software that
quantForestError
results are deterministic once the inputforest
is trained, such that it is safe to split large dataframes ofX.test
into pieces to prevent overloading memory. It also seems to run in linear time with number of samples inX.test
. This should allow for convenient parallel execution of smaller calls by batch onX.test
. However, ranger::predict by default uses all cores available (and I believerandomForestSRC
does the same), which leads to problems in parallel execution of thequantForestError
.A simple solution would be to enable a user input for nCores to tune this setting manually, defaulting to the underlying model class's default behavior. I wanted to raise the issue more generally, though, because the
vector memory exhausted
errors (or the candidate solution to parcel out theX.test
workload) aren't necessarily obvious and might merit some documentation. I'd be happy to work on a vignette or something if useful.The actual limits will depend on hardware (I'm running this on an iMac with 64GB RAM) but below is a code example. On my machine it doesn't fail until something between 1k and 50k test rows; I didn't get an exact breakdown point. It should be possible to find a heuristic for when X.test needs to be broken up given knowledge of available RAM, but that's hard in the wild.
The text was updated successfully, but these errors were encountered: