Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

db_create bug at the 22nd file #14

Closed
JanPerret opened this issue Dec 10, 2018 · 15 comments · Fixed by #24
Closed

db_create bug at the 22nd file #14

JanPerret opened this issue Dec 10, 2018 · 15 comments · Fixed by #24
Assignees
Labels
bug Something isn't working

Comments

@JanPerret
Copy link

Hi,

I've got an issue with the db_create function and I can't figure out if it comes from my computer or from the function itself. I want to create a database with all data concerning animals from GenBank, and a second database with the plants and fungi. I tried both but each time I run the function db_create, the execution stops and I get an error message when the 22nd file is added to the database. Here is the code I ran for the animals database, and the end of the error message I got (I saved only the end of it last time I ran it). I've ran he function several times with different min_length and max_length values and it ended the same way each time.

I got the same message when I tried to create the plants and fungi database (program stops when adding the 22nd file), but I tried to create a DB for only the rodents or only the "other mammals" and it went well without any problem, and the resulting database was functional for both (even if there are more than 22 files in these parts of GenBank).

Has someone an idea where this problem could come from ? I'm not very experimented so the reason may be obvious, but I don't get it.

Thank you for your help !

Executed program

library(restez)

# create a new folder for the database
restez_path <- file.path(getwd(), 'GenBank_animals')

# create files
dir.create(restez_path)

# set the restez path
restez_path_set(restez_path)

# download all parts of GenBank concerning animals
db_download(preselection = '9 11 12 14 15')

# create the database
restez_connect()
db_create(min_length = 0, max_length = 5000)

Error message

(end of the output only)
... 'gbinv15.seq.gz' (20/351)
... 'gbinv16.seq.gz' (21/351)
... 'gbinv17.seq.gz' (22/351)
: Invalid argument
simpleError in gsub("\n", " ", res, fixed = TRUE): Unable to execute statement 'ROLLBACK'.
Server says 'SQLException:sql.trans:2DM30!ROLLBACK: not allowed in auto commit mode'.>
 
Error in gsub("\n", " ", res, fixed = TRUE) : 
  input string 1 is invalid UTF-8
Session Info - Session info -------------------------------------------------------------------------------------

setting value
version R version 3.4.3 (2017-11-30)
os Windows >= 8 x64
system x86_64, mingw32
ui RStudio
language (EN)
collate French_France.1252
ctype French_France.1252
tz Europe/Paris
date 2018-12-10

  • Packages ----------------------------------------------------------------------------------------
    package * version date lib source
    assertthat 0.2.0 2017-04-11 [1] CRAN (R 3.4.4)
    backports 1.1.2 2017-12-13 [1] CRAN (R 3.4.4)
    base64enc 0.1-3 2015-07-28 [1] CRAN (R 3.4.4)
    bindr 0.1.1 2018-03-13 [1] CRAN (R 3.4.4)
    bindrcpp 0.2.2 2018-03-29 [1] CRAN (R 3.4.4)
    bitops 1.0-6 2013-08-17 [1] CRAN (R 3.4.4)
    callr 3.0.0 2018-08-24 [1] CRAN (R 3.4.4)
    cli 1.0.1 2018-09-25 [1] CRAN (R 3.4.4)
    codetools 0.2-15 2016-10-05 [1] CRAN (R 3.4.4)
    crayon 1.3.4 2017-09-16 [1] CRAN (R 3.4.4)
    DBI 1.0.0 2018-05-02 [1] CRAN (R 3.4.4)
    desc 1.2.0 2018-05-01 [1] CRAN (R 3.4.4)
    devtools 2.0.1 2018-10-26 [1] CRAN (R 3.4.4)
    digest 0.6.18 2018-10-10 [1] CRAN (R 3.4.4)
    dplyr 0.7.8 2018-11-10 [1] CRAN (R 3.4.4)
    fs 1.2.6 2018-08-23 [1] CRAN (R 3.4.4)
    glue 1.3.0 2018-07-17 [1] CRAN (R 3.4.4)
    magrittr 1.5 2014-11-22 [1] CRAN (R 3.4.4)
    memoise 1.1.0 2017-04-21 [1] CRAN (R 3.4.4)
    MonetDBLite 0.6.0 2018-07-27 [1] CRAN (R 3.4.4)
    pillar 1.3.0 2018-07-14 [1] CRAN (R 3.4.4)
    pkgbuild 1.0.2 2018-10-16 [1] CRAN (R 3.4.4)
    pkgconfig 2.0.2 2018-08-16 [1] CRAN (R 3.4.4)
    pkgload 1.0.2 2018-10-29 [1] CRAN (R 3.4.4)
    prettyunits 1.0.2 2015-07-13 [1] CRAN (R 3.4.4)
    processx 3.2.1 2018-12-05 [1] CRAN (R 3.4.4)
    ps 1.2.1 2018-11-06 [1] CRAN (R 3.4.4)
    purrr 0.2.5 2018-05-29 [1] CRAN (R 3.4.4)
    R6 2.3.0 2018-10-04 [1] CRAN (R 3.4.4)
    Rcpp 1.0.0 2018-11-07 [1] CRAN (R 3.4.4)
    RCurl 1.95-4.11 2018-07-15 [1] CRAN (R 3.4.4)
    remotes 2.0.2 2018-10-30 [1] CRAN (R 3.4.4)
    restez * 1.0.0 2018-11-26 [1] CRAN (R 3.4.4)
    rlang 0.3.0.1 2018-10-25 [1] CRAN (R 3.4.4)
    rprojroot 1.3-2 2018-01-03 [1] CRAN (R 3.4.4)
    rstudioapi 0.8 2018-10-02 [1] CRAN (R 3.4.4)
    sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 3.4.4)
    tibble 1.4.2 2018-01-22 [1] CRAN (R 3.4.4)
    tidyselect 0.2.5 2018-10-11 [1] CRAN (R 3.4.4)
    usethis 1.4.0 2018-08-14 [1] CRAN (R 3.4.4)
    withr 2.1.2 2018-03-15 [1] CRAN (R 3.4.4)

[1] C:/Users/Perret/Documents/R/win-library/3.4
[2] C:/Program Files/R/R-3.4.3/library

@DomBennett
Copy link
Contributor

Hi Jan,

Thanks for raising this issue. I shall investigate the problem.

It is likely to do with the format of a record in gbinv17.seq.gz. Do you know at which file your plant and fungi run failed? I'm guessing it was a fungal file.

Dom

@DomBennett DomBennett self-assigned this Dec 10, 2018
@DomBennett DomBennett added the bug Something isn't working label Dec 10, 2018
@JanPerret
Copy link
Author

Hi Dom,

Thanks for your quick answer. For the plant and fungi run it failed at file gbpln118.seq.gz.

... 'gbpln114.seq.gz' (18/232)
... 'gbpln115.seq.gz' (19/232)
... 'gbpln116.seq.gz' (20/232)
... 'gbpln117.seq.gz' (21/232)
... 'gbpln118.seq.gz' (22/232)
: Invalid argument
simpleError in gsub("\n", " ", res, fixed = TRUE): Unable to execute statement 'ROLLBACK'.
Server says 'SQLException:sql.trans:2DM30!ROLLBACK: not allowed in auto commit mode'.>
Error in gsub("\n", " ", res, fixed = TRUE) : 
  input string 1 is invalid UTF-8

But actually I just realised that another time the program stopped on 24th position, so it is very likely that the problem has nothing to do with the 22th position but rather with some files format as you say.

... 'gbpln117.seq.gz' (21/232)
... 'gbpln118.seq.gz' (22/232)
... 'gbpln119.seq.gz' (23/232)
... 'gbpln12.seq.gz' (24/232)
: Invalid argument
simpleError in gsub("\n", " ", res, fixed = TRUE): Unable to execute statement 'ROLLBACK'.
Server says 'SQLException:sql.trans:2DM30!ROLLBACK: not allowed in auto commit mode'.>
Error in gsub("\n", " ", res, fixed = TRUE) : 
  input string 1 is invalid UTF-8

If I remember well, the only difference between the 2 executions was the computer (office vs. home computer) and the value of the max_length argument. For the first one it was max_length =10000 and the second max_length=6000. Maybe a record with bad format in gbpln118.seq.gz was excluded by passing max_length from 10000 to 6000 ?

Jan

@JanPerret
Copy link
Author

I just tried to re-run the DB creation after updating R (going from version 3.4.3 to 3.5.1) to see if it could change something, it stopped again at the same file than last time for the plants and fungi, but I don't get exactly the same error message. I don't know if this can help.

> db_create(min_length = 0, max_length = 6000)
Adding 232 file(s) to the database ...
... 'gbpln1.seq.gz' (1/232)
... 'gbpln10.seq.gz' (2/232)
... 'gbpln100.seq.gz' (3/232)
... 'gbpln101.seq.gz' (4/232)
... 'gbpln102.seq.gz' (5/232)
... 'gbpln103.seq.gz' (6/232)
... 'gbpln104.seq.gz' (7/232)
... 'gbpln105.seq.gz' (8/232)
... 'gbpln106.seq.gz' (9/232)
... 'gbpln107.seq.gz' (10/232)
... 'gbpln108.seq.gz' (11/232)
... 'gbpln109.seq.gz' (12/232)
... 'gbpln11.seq.gz' (13/232)
... 'gbpln110.seq.gz' (14/232)
... 'gbpln111.seq.gz' (15/232)
... 'gbpln112.seq.gz' (16/232)
... 'gbpln113.seq.gz' (17/232)
... 'gbpln114.seq.gz' (18/232)
... 'gbpln115.seq.gz' (19/232)
... 'gbpln116.seq.gz' (20/232)
... 'gbpln117.seq.gz' (21/232)
... 'gbpln118.seq.gz' (22/232)
... 'gbpln119.seq.gz' (23/232)
... 'gbpln12.seq.gz' (24/232)
: Invalid argument
simpleError in gsub("\n", " ", res, fixed = TRUE): input string 1 is invalid UTF-8>
Error in gsub("\n", " ", res, fixed = TRUE) : 
  input string 1 is invalid UTF-8

Jan

@DomBennett
Copy link
Contributor

Hi Jan,

Thanks for your detailed reporting. It looks like the problem is a little more complicated than I first thought. The error is coming from the MonetDBlite package and may be related to Windows -- those problem files (gbinv17, gbpln118 and gbpln12) work fine for me (I use Unix).

Possibilities:

  • Non-UTF-8 encoding of record types (GenBank records seem to be latin1)
  • Corrupted download

To rule out the second option, you could try running db_create() again but deleting the problem files from your downloads/ (gbinv17, gbpln118 and gbpln12). Just navigate to your restez path and delete them there and try re-running the downloads. Does the problem re-occur for a different file?

Dom

@JanPerret
Copy link
Author

Hi Dom,

I downloaded again the gbinv17 file and ran the db_create function, it led to the same result :

db_create(min_length = 0, max_length = 4000)
Adding 351 file(s) to the database ...
... 'gbinv1.seq.gz' (1/351)
... 'gbinv10.seq.gz' (2/351)
... 'gbinv100.seq.gz' (3/351)
... 'gbinv101.seq.gz' (4/351)
... 'gbinv102.seq.gz' (5/351)
... 'gbinv103.seq.gz' (6/351)
... 'gbinv104.seq.gz' (7/351)
... 'gbinv105.seq.gz' (8/351)
... 'gbinv106.seq.gz' (9/351)
... 'gbinv107.seq.gz' (10/351)
... 'gbinv108.seq.gz' (11/351)
... 'gbinv109.seq.gz' (12/351)
... 'gbinv11.seq.gz' (13/351)
... 'gbinv110.seq.gz' (14/351)
... 'gbinv111.seq.gz' (15/351)
... 'gbinv112.seq.gz' (16/351)
... 'gbinv12.seq.gz' (17/351)
... 'gbinv13.seq.gz' (18/351)
... 'gbinv14.seq.gz' (19/351)
... 'gbinv15.seq.gz' (20/351)
... 'gbinv16.seq.gz' (21/351)
... 'gbinv17.seq.gz' (22/351)
: Invalid argument
simpleError in gsub("\n", " ", res, fixed = TRUE): Unable to execute statement 'ROLLBACK'.
Server says 'SQLException:sql.trans:2DM30!ROLLBACK: not allowed in auto commit mode'.>
 
Error in gsub("\n", " ", res, fixed = TRUE) : 
  input string 1 is invalid UTF-8

Jan

@DomBennett
Copy link
Contributor

Sorry Jan, I meant to ask you to delete gbinv17, gbpln118 and gbpln12 entirely from the downloads/ folder and re-running db_create without re-downloading them. I want to know whether the problem will occur with other files as well.

Thanks for your efforts,
Dom

@DomBennett
Copy link
Contributor

Hi Jan,

On running your script, I was able to recreate your error on a Windows computer. It doesn't seem to be due to anything specific on your end. I will try and find out what is causing it in Windows.

Dom

@JanPerret
Copy link
Author

Hi Dom,

My bad, I've read too quickly your response yesterday and I had no time to write you back before now. This morning I have deleted the file gbinv17 and run db_create, it stopped at the file gbinv18. So I deleted this file as well and ran again the function, it stopped at gbinv19.

If it works for you on Unix I think I am going to try to do it by passing through an Ubuntu live session, I'll tell you if it works. Thanks a lot you for your help, and let me know if you find a solution to run it on Windows !

Jan

@capoony
Copy link

capoony commented Aug 10, 2021

Dear @DomBennett and @jeroen ,

thanks for this nice piece of code! I unfortunatly, ran into similar problems as described above. More specifically, several seq.gz files appear to contain entries that cannot be parsed correctly.

For example , I isolated file "gbbct568.seq.gz" from the "Bacterial" GenBank database (v. 244). When creating the sql database, I get the following error:

> library(restez)
> restez_path_set("/media/scratch/GenBank/faulty")
> db_create()
Adding 1 file(s) to the database ...
... 'gbbct568.seq.gz' (1/1)
Error in .local(conn, name, value, ...) :
  Failed to insert data: SQLException:assert:M0M29!INSERT INTO: PRIMARY KEY constraint 'nucleotide.nucleotide_accession_pkey' violated
Fehler: callr subprocess failed: Failed to insert data: SQLException:assert:M0M29!INSERT INTO: PRIMARY KEY constraint 'nucleotide.nucleotide_accession_pkey' violated
Type .Last.error.trace to see where the error occurred
>

In addition, I receive the following error when creating the database for "gbpln193.seq.gz" from the "plant" database.

> library(restez)
> restez_path_set("/media/scratch/GenBank/faulty")
> db_create()
Adding 1 file(s) to the database ...
... 'gbpln193.seq.gz' (1/1)
Error in gsub(pattern = "([0-9]|\\s+|\n|/)", replacement = "", x = seqrecpart) :
  'Calloc' konnte keinen Speicher (18446744071562067968 von 1 Bytes) zuteilen
Fehler: callr subprocess failed: 'Calloc' konnte keinen Speicher (18446744071562067968 von 1 Bytes) zuteilen
Type .Last.error.trace to see where the error occurred
>

Notably, the same error is raised, when using the subsequent files (*94.seq.gz; *95.seq.gz, etc.) as well. I realized that all of problematic files from the plant database only consist of a single entry. Could this explain the issue?

I would very much appreciate your help with this. We need to do thousands of database searches and your approach would be an enormous time-saver!

Thanks a lot,

Martin

@joelnitta
Copy link
Contributor

joelnitta commented May 9, 2022

I am having the same problem with plants DB.

With this script:

library(restez)

# Specify location to download GenBank database
restez_path_set("/data_raw")

# Download plant database
db_download(preselection = 1)

# Create database
restez_connect()
db_create(min_length = 10)  

I get this error (only showing end of output):

... 'gbpln209.seq.gz' (123/882)                                                                                                                                                                                          
... 'gbpln21.seq.gz' (124/882)                                                                                                                                                                                           
... 'gbpln210.seq.gz' (125/882)                                                                                                                                                                                          
Error in gsub(pattern = "([0-9]|\\s+|\n|/)", replacement = "", x = seqrecpart) :                                                                                                                                         
  'Calloc' could not allocate memory (18446744071562067968 of 1 bytes)                                                                                                                                                   
Error: callr subprocess failed: 'Calloc' could not allocate memory (18446744071562067968 of 1 bytes)                                                                                                                     
Type .Last.error.trace to see where the error occurred 

The error message is in a different language from @capoony above, but appears to be the same, down to the long string of digits after allocate memory.

Session info:

> sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] restez_1.0.2

loaded via a namespace (and not attached):
 [1] magrittr_2.0.1      tidyselect_1.1.0    R6_2.4.1           
 [4] rlang_1.0.2         fansi_0.4.1         blob_1.2.1         
 [7] dplyr_1.0.7         tools_4.1.1         utf8_1.1.4         
[10] cli_3.3.0           DBI_1.1.0           ellipsis_0.3.2     
[13] assertthat_0.2.1    conflicted_1.0.4    digest_0.6.25      
[16] tibble_3.1.5        lifecycle_1.0.1     crayon_1.3.4       
[19] processx_3.5.2      purrr_0.3.4         BiocManager_1.30.16
[22] callr_3.7.0         vctrs_0.3.8         ps_1.7.0           
[25] codetools_0.2-16    curl_4.3            glue_1.6.2         
[28] MonetDBLite_0.6.1   compiler_4.1.1      pillar_1.6.3       
[31] generics_0.0.2      renv_0.14.0         pkgconfig_2.0.3    

jeroen added a commit that referenced this issue May 9, 2022
to address memory problems in #14
@jeroen
Copy link
Member

jeroen commented May 9, 2022

I'm not an expert but it looks like it leaks memory. I have tried to fix the downloading mechanism. Could you try to install the new version from https://ropensci.r-universe.dev/ui#package:restez and try again?

@maelle
Copy link
Member

maelle commented May 9, 2022

For info, we're looking for a new maintainer / a new maintainer team for this package, see #23. If you read this and are a restez user, feel free to volunteer, we'd be happy to help.

@joelnitta
Copy link
Contributor

@jeroen Thanks! I tried with the new version of restez (v1.0.3).

Unrelated to this problem, the download took much longer (~3 days instead of overnight; I didn't log times but it was an obvious difference).

I still get the same error, at the same file:

... 'gbpln21.seq.gz' (124/881)
... 'gbpln210.seq.gz' (125/881)
Error in gsub(pattern = "([0-9]|\\s+|\n|/)", replacement = "", x = seqrecpart) : 
  'Calloc' could not allocate memory (18446744071562067968 of 1 bytes)

This time I captured the output of traceback() for you:

> traceback()
12: gsub(pattern = "([0-9]|\\s+|\n|/)", replacement = "", x = seqrecpart)
11: extract_clean_sequence(x)
10: FUN(X[[i]], ...)
9: vapply(X = seqrecparts, FUN = function(x) {
       nchar(extract_clean_sequence(x))
   }, FUN.VALUE = integer(1))
8: unname(vapply(X = seqrecparts, FUN = function(x) {
       nchar(extract_clean_sequence(x))
   }, FUN.VALUE = integer(1)))
7: gb_df_generate(records = records, min_length = min_length, max_length = max_length)
6: gb_build(dpth = dpth, seq_files = seq_files, max_length = max_length, 
       min_length = min_length)
5: db_create(min_length = 10) at setup_gb.R#11
4: eval(ei, envir)
3: eval(ei, envir)
2: withVisible(eval(ei, envir))
1: source("R/setup_gb.R")

It seems there is something in that file (gbpln210.seq.gz) that gsub() is dying on.

sessionInfo() ``` > sessionInfo() R version 4.1.1 (2021-08-10) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices datasets utils methods base

other attached packages:
[1] restez_1.0.3

loaded via a namespace (and not attached):
[1] magrittr_2.0.1 tidyselect_1.1.0 R6_2.4.1
[4] rlang_1.0.2 fansi_0.4.1 blob_1.2.1
[7] dplyr_1.0.7 tools_4.1.1 sessioninfo_1.2.2
[10] utf8_1.1.4 cli_3.3.0 DBI_1.1.0
[13] ellipsis_0.3.2 assertthat_0.2.1 conflicted_1.0.4
[16] digest_0.6.25 tibble_3.1.5 lifecycle_1.0.1
[19] crayon_1.3.4 BiocManager_1.30.16 purrr_0.3.4
[22] vctrs_0.3.8 codetools_0.2-16 curl_4.3
[25] glue_1.6.2 MonetDBLite_0.6.1 compiler_4.1.1
[28] pillar_1.6.3 generics_0.0.2 renv_0.14.0
[31] pkgconfig_2.0.3

</details>

joelnitta added a commit to joelnitta/restez that referenced this issue May 16, 2022
@joelnitta
Copy link
Contributor

joelnitta commented May 19, 2022

It turns out it was a very long DNA sequence in a single file causing the problem: gbpln210.seq.gz is a single sequence with 1198270666 characters. This causes the error in gsub().

I've issued a PR with a fix: #24

@jeroen
Copy link
Member

jeroen commented May 19, 2022

Thanks! I think this look OK.
The original maintainer is not active, but I have given you maintainer privileges so you can merge the PR when it is ready!

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants