-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
183 lines (131 loc) · 6.17 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
output:
md_document:
variant: markdown_github
---
[)`-brightgreen.svg)](https://github.com/rmflight/waitcopy/commits/master)
[](https://cran.r-project.org/package=waitcopy)
[](https://travis-ci.org/rmflight/waitcopy)
[](https://codecov.io/github/rmflight/waitcopy?branch=master)
[](http://choosealicense.com/licenses/mit/)
[](http://orcid.org/0000-0001-8141-7788)
# waitcopy
Copy files during particular times of day and with metadata.
## Useful Definitions
- **filepath**: the full path to a given file, eg /home/user/file1.png
- **basename**: the actual filename of a file path, eg file1.png in /home/user/file1.png
- **dirname**: the directory portion of a filepath, eg /home/user in /home/user/file1.png
## Description
Provides the function `wait_copy` that:
- will only copy files during a set time interval
- will wait a specific amount of time between file copies
- creates a json file with some limited meta-data about the file
- removes special characters from the file name
### Why??
Imagine someone you work with has a hard drive that you need data from, but that
hard drive is only accessible via the network, mounted via SAMBA, and there are
potentially duplicate files with the file or base name, files with the same file name
that are different, and the file path provides some meta-information about the
sample. In addition, the file names have odd characters in them (spaces, colons, etc)
that make them a pain to work with from the command line on Linux, so you'd prefer
if they weren't there.
The files are small, so even copying over the network is fast, but if you copy
too many too quickly during the day, you'll get complaints about hitting this shared
resource too often by the people who are local to it.
### The Solution
So ideally, you want to copy the files only during certain hours, wait a little
bit between each copy operation, check for duplicates (via names and md5 hashing),
strip the file name of special characters, and note where the file originated.
`waitcopy` provides these capabilities.
## How it Works
Given a file to copy, and a location to copy it to, does a few things:
* strip special characters and spaces from the file name (if asked)
* copy the file to a temp location
* save the original path to the file it was being copied from
* calculate the MD5 hash of the file
* check master json data of MD5 hashes and file names
* if MD5 is new, add the new file name, original file path, and MD5 to the master
json file
* if MD5 is not new, **add** the original file path to the matching file entry
in the master json file.
* if a matching file name is found but with a different MD5 hash, append the
first 8 digits of the MD5 hash to the file name, and add it to the json data
## Installation
```r
# install.packages(devtools)
devtools::install_github("MoseleyBioinformaticsLab/waitcopy")
```
## Example Usage
### Worked Example
Lets imagine that we want to copy a set of files during a set time, and one of
the files is duplicated (but we don't know that before we start).
```{r get_files}
library(waitcopy)
library(lubridate)
# files are in the extdata directory of waitcopy
testloc <- system.file("extdata", "set1", package = "waitcopy")
file_list <- dir(testloc, pattern = "raw", full.names = TRUE)
file_list
```
We will setup a **temp** directory to copy them to:
```{r temp_directory}
temp_dir <- tempfile(pattern = "copyfiles-test-1")
dir.create(temp_dir)
dir(temp_dir)
```
And then lets set up to copy **20s** from now.
```{r set_copy_time}
curr_time <- waitcopy:::get_now_in_local()
curr_today <- waitcopy:::get_today_in_local()
now_minus_today <- difftime(curr_time, curr_today, units = "s")
beg_time <- seconds(now_minus_today + 20)
end_time <- seconds(now_minus_today + 3600)
beg_time
end_time
```
And now let's copy! This is in the **near** future, so we will set the `wait_check`
parameter to a low value of only 10 seconds, normally this is set to 30 minutes (1800 seconds),
assuming that it is in the far future when you want to copy the files.
```{r copy_it}
wait_copy(file_list, temp_dir, json_meta = file.path(temp_dir, "all_meta.json"),
start_time = beg_time, stop_time = end_time, wait_check = 10, pause_file = 0)
```
Lets look at how many files were copied and the contents of the JSON metadata.
```{r copied_files_json}
copied_files <- dir(temp_dir)
copied_files
meta_json <- jsonlite::fromJSON(file.path(temp_dir, "all_meta.json"), simplifyVector = FALSE)
jsonlite::toJSON(meta_json, auto_unbox = TRUE, pretty = TRUE)
```
### Alternatives (not run)
#### Default
If you just want to copy between 8pm and 6am everyday for however long it will take:
```r
wait_copy(file_list, temp_dir, json_meta = file.path(temp_dir, "all_meta.json"))
```
#### Change Start or End Time in Hours
What if the best time to copy files was from 10am until 1pm (13:00)??
```r
wait_copy(file_list, temp_dir, json_meta = file.path(temp_dir, "all_meta.json"),
start_time = hours(10), stop_time = hours(13))
```
#### Don't Set a Time Limit
```r
wait_copy(file_list, temp_dir, json_meta = file.path(temp_dir, "all_meta.json"),
time_limit = FALSE)
```
#### Stop Checking The Time
If you want the function to give up after trying to check the time, then change
the `n_check` variable. If you wanted to stop checking after 3 tries:
```r
wait_copy(file_list, temp_dir, json_meta = file.path(temp_dir, "all_meta.json"),
n_check = 3)
```
#### Use Different Renaming Function
An alternative way to handle nasty file names would be to use the `make.names` function:
```r
wait_copy(file_list, temp_dir, json_meta = file.path(temp_dir, "all_meta.json"),
clean_file_fun = make.names)
```
Note that this is only applied to the `basename` of the file path, i.e. the actual
file-name after removing the path in front of the file-name.