I have a drake workflow with a long-running step that I want to build
using the RStudio Job launcher. But without additional work, running
as an RStudio job invalidates targets that depend on
functions sourced into the global environment. Running drake::make()
or drake::r_make()
from a standard environment after running
inside the RStudio job launcher will again invalidate old
targets, including those just built by the job launcher.
The following demonstrates the issue but requires manual intervention due to the job launcher.
Utlimately, loading dependencies into a new environment instead of the global environment allows all import dependencies to be correctly discovered and tracked by drake, as demonstrated at the end.
Start with a clean project.
if (dir.exists(".drake")) unlink(".drake", recursive = TRUE)
if (file.exists("data/mtcars.rds")) unlink("data/mtcars.rds")
The project (based on the drake mtcars
example), has
a couple long-running targets, regression1
and regression2
using the
and large
data sets.
## # A tibble: 16 x 2
## target command
## <chr> <expr>
## 1 data import_data(file_in("data/mtcars.csv")) …
## 2 report knitr::knit(knitr_in("report.Rmd"), file_out("report…
## 3 small simulate(48, data) …
## 4 large simulate(64, data) …
## 5 regression1_small reg1(small) …
## 6 regression1_large reg1(large) …
## 7 regression2_large reg2(large) …
## 8 regression2_small reg2(small) …
## 9 summ_regression1_… suppressWarnings(summary(regression1_large$residuals…
## 10 summ_regression1_… suppressWarnings(summary(regression1_small$residuals…
## 11 summ_regression2_… suppressWarnings(summary(regression2_large$residuals…
## 12 summ_regression2_… suppressWarnings(summary(regression2_small$residuals…
## 13 coef_regression1_… suppressWarnings(summary(regression1_large))$coeffic…
## 14 coef_regression1_… suppressWarnings(summary(regression1_small))$coeffic…
## 15 coef_regression2_… suppressWarnings(summary(regression2_large))$coeffic…
## 16 coef_regression2_… suppressWarnings(summary(regression2_small))$coeffic…
In this example, the regression targets build quickly, but in my real-world use case they could be long-running model fits or cross validation steps that potentially take hours. Because I’d rather not lock up my console for that long, I would like to build these targets in the background using the RStudio Jobs launcher. To do this, I’ve created a small script called _drake_rstudio-job.R that simply contains
make(plan, targets = c(
"regression1_small", "regression1_large",
"regression2_small", "regression2_large"
Starting from a clean project:
## [1] "coef_regression1_large" "coef_regression1_small" "coef_regression2_large"
## [4] "coef_regression2_small" "data" "large"
## [7] "regression1_large" "regression1_small" "regression2_large"
## [10] "regression2_small" "report" "small"
## [13] "summ_regression1_large" "summ_regression1_small" "summ_regression2_large"
## [16] "summ_regression2_small"
Run _drake_rstudio-job.R
in an RStudio background job.
target data
target large
target small
target regression1_large
target regression2_large
target regression1_small
target regression2_small
And then save the cache log manually.
cl <- drake::drake_cache_log()
readr::write_tsv(cl, "cache_log/01a-after-RStudio-job.log")
hash | type | name |
a7afcfc52a180687 | target | data |
a44bdd51b1e985e3 | import | import_data |
7b6504d4c0dcceab | target | large |
503a24ae76dad431 | import | knitr::knit |
e48820305a44c4d2 | import | file data/mtcars.csv |
7a456ee58df699be | import | file report.Rmd |
db84c9f752635a13 | import | random_rows |
21935c86f12692e2 | import | reg1 |
69ade4b78f15a3f9 | import | reg2 |
e1501ed9d62b846e | target | regression1_large |
2a400716e73eac8f | target | regression1_small |
fe8ac4561279fb68 | target | regression2_large |
c0f94be64233efe8 | target | regression2_small |
9b31e33768015e24 | import | simulate |
78414659cd5e9997 | target | small |
Run _drake_rstudio-job.R
in an RStudio background job again.
All targets are already up to date.
Back in RStudio console, run
target data
target large
target small
target regression1_large
target regression2_large
target regression1_small
target regression2_small
target summ_regression1_large
target coef_regression1_large
target summ_regression2_large
target coef_regression2_large
target summ_regression1_small
target coef_regression1_small
target coef_regression2_small
target summ_regression2_small
target report
Notice that the targets we made in the RStudio job were re-made!
cl <- drake::drake_cache_log()
readr::write_tsv(cl, "cache_log/01b-after-r_make.log")
But none of the hashes changed between the RStudio background job run
and the r_make()
bg <- readr::read_tsv("cache_log/01a-after-RStudio-job.log")
cs <- readr::read_tsv("cache_log/01b-after-r_make.log")
both <- dplyr::full_join(bg, cs, by = c("type", "name"), suffix = c(".bg", ".console"))
both <- dplyr::select(both, type, name, dplyr::everything())
type | name | hash.bg | hash.console |
target | data | a7afcfc52a180687 | a7afcfc52a180687 |
import | import_data | a44bdd51b1e985e3 | a44bdd51b1e985e3 |
target | large | 7b6504d4c0dcceab | 7b6504d4c0dcceab |
import | knitr::knit | 503a24ae76dad431 | 503a24ae76dad431 |
import | file data/mtcars.csv | e48820305a44c4d2 | e48820305a44c4d2 |
import | file report.Rmd | 7a456ee58df699be | 7a456ee58df699be |
import | random_rows | db84c9f752635a13 | db84c9f752635a13 |
import | reg1 | 21935c86f12692e2 | 21935c86f12692e2 |
import | reg2 | 69ade4b78f15a3f9 | 69ade4b78f15a3f9 |
target | regression1_large | e1501ed9d62b846e | e1501ed9d62b846e |
target | regression1_small | 2a400716e73eac8f | 2a400716e73eac8f |
target | regression2_large | fe8ac4561279fb68 | fe8ac4561279fb68 |
target | regression2_small | c0f94be64233efe8 | c0f94be64233efe8 |
import | simulate | 9b31e33768015e24 | 9b31e33768015e24 |
target | small | 78414659cd5e9997 | 78414659cd5e9997 |
target | coef_regression1_large | NA | 4e6e4d4c0bcda263 |
target | coef_regression1_small | NA | 5ad25785a84e0cac |
target | coef_regression2_large | NA | b2662785d55b28c1 |
target | coef_regression2_small | NA | 23c66b36be56905b |
target | file report.md | NA | fa9a77f4ba63574e |
target | report | NA | f3bb623354b52fa1 |
target | summ_regression1_large | NA | 065f93613f1cb914 |
target | summ_regression1_small | NA | 2362318068e2692d |
target | summ_regression2_large | NA | c2c95619e31d7fa8 |
target | summ_regression2_small | NA | c7dc974fa996335f |
If we had paused between the background job and the second make step, we would have seen the following:
deps_profile("data", config)
## # A tibble: 4 x 4
## hash changed old_hash new_hash
## <chr> <lgl> <chr> <chr>
## 1 command FALSE 40c2ded1562d6fda 40c2ded1562d6fda
## 2 depend TRUE "" 4f18907a711e6c41
## 3 file_in FALSE a0775797ef1a5066 a0775797ef1a5066
## 4 file_out FALSE "" ""
deps_target("data", config)
## # A tibble: 2 x 2
## name type
## <chr> <chr>
## 1 import_data globals
## 2 data/mtcars.csv file_in
To see what I was expecting to happen, reset the project and repeat the process without using the RStudio Job Launcher.
## [1] "coef_regression1_large" "coef_regression1_small" "coef_regression2_large"
## [4] "coef_regression2_small" "data" "large"
## [7] "regression1_large" "regression1_small" "regression2_large"
## [10] "regression2_small" "report" "small"
## [13] "summ_regression1_large" "summ_regression1_small" "summ_regression2_large"
## [16] "summ_regression2_small"
I’ll use callr::rscript()
as a replacement for what I expected to
callr::rscript("_drake_rstudio-job.R", show = TRUE)
target data
target large
target small
target regression1_large
target regression2_large
target regression1_small
target regression2_small
target coef_regression2_small
target summ_regression1_large
target summ_regression1_small
target summ_regression2_large
target summ_regression2_small
target coef_regression1_large
target coef_regression1_small
target coef_regression2_large
target report
I think the cause of the target invalidating is related to some global environment shenaningans that happens in the RStudio job launcher.
The script ls_GlobalEnv.R lists the global environment objects after sourcing _drake.R. Running this script from the R console returns the following.
# In fresh session
callr::rscript("ls_GlobalEnv.R", show = TRUE)
[1] ".Random.seed" "config" "import_data" "plan" "r_files"
[6] "random_rows" "reg1" "reg2" "simulate"
But running the same script in a Local RStudio Job indicates that there are additional objects added to the global environment by the RStudio Job launcher.
# Must be run manually in RStudio
rsjob <- new.env()
rstudioapi::jobRunScript("ls_GlobalEnv.R", workingDir = getwd(), exportEnv = "rsjob")
[1] ".Random.seed" "config" "emitProgress"
[4] "import_data" "plan" "r_files"
[7] "random_rows" "reg1" "reg2"
[10] "simulate" "sourceWithProgress"
In particular, emitProgress
A possible solution (workaround?) is demonstrated in _drake_env.R, where a specific environment is created to hold the project’s function dependencies.
drake_env <- new.env()
r_files <- fs::dir_ls("R")
purrr::walk(r_files, sys.source, envir = drake_env)
We then need to point drake to this environment in drake_config()
, etc. with the envir
config <- drake_config(plan, envir = drake_env)
Having done this, the plan is now made reproducibly using RStudio Jobs.
Run _drake_env.R
in an RStudio background job
target data
target large
target small
target regression1_large
target regression2_large
target regression1_small
target regression2_small
and again save the state of the drake cache after this step (not shown, but stored here).
cl <- drake::drake_cache_log()
readr::write_tsv(cl, "cache_log/02a-after-RStudio-job.log")
Running make again will now correctly build only the unbuilt targets.
callr::r_safe(function() {
make(config = config)
}, show = TRUE)
target coef_regression2_small
target summ_regression1_large
target summ_regression1_small
target summ_regression2_large
target summ_regression2_small
target coef_regression1_large
target coef_regression1_small
target coef_regression2_large
target report
cl <- drake::drake_cache_log()
readr::write_tsv(cl, "cache_log/02b-after-make.log")
(View cache log.)