Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Optimize subsetting in genlight objects #48

Merged
merged 11 commits into from
Apr 20, 2015
Merged

Conversation

zkamvar
Copy link
Collaborator

@zkamvar zkamvar commented Apr 20, 2015

Subsetting genlight objects had a bottleneck when subsetting SNPbin objects. For objects with 1 million snps, it would take about a second to subset 10 snps from the object (and ~5 seconds to subset 999,999 snps). When thinking about doing bootstrapping or sliding windows, it would be painfully slow.

I found that this was due to the fact that the SNPbin object was being rebuilt for every subset. I added an internal function that will subset the raw vector by converting it to bits, subsetting them, and then using packBits() to pack them all into a raw vector.

I also removed the creation of a new genlight object in favor of simply subsetting the slots. This way the object can be inherited properly

The speedup is an order of magnitude.

For benchmarking, microbenchmark was used with 4 data sets each containing 50 samples with 1% missing data (based off of the example in the genlight documentation):

data nLoc
x 1,000
y 10,000
z 100,000
zz 1,000,000

50 random loci were used for subsetting:

> dput(the_loci)
c(68L, 366L, 773L, 609L, 196L, 180L, 420L, 675L, 125L, 138L, 
49L, 606L, 13L, 979L, 544L, 576L, 106L, 108L, 854L, 731L, 439L, 
161L, 137L, 990L, 66L, 516L, 922L, 247L, 25L, 868L, 151L, 447L, 
58L, 313L, 874L, 167L, 523L, 801L, 299L, 132L, 776L, 870L, 585L, 
110L, 240L, 381L, 114L, 403L, 97L, 197L)

Subset and reconstruct (old method):

Unit: milliseconds
           expr        min         lq       mean     median         uq        max neval
  x[, the_loci]   12.69427   14.01981   15.37758   14.94392   16.15417   23.03208   100
  y[, the_loci]   25.41475   28.27175   30.67594   29.52679   31.66637   48.77981   100
  z[, the_loci]  187.44007  202.46392  209.83472  207.75600  216.46626  257.66354   100
 zz[, the_loci] 1774.08317 1815.80018 1854.20525 1838.57798 1882.47164 2085.26544   100

Subset directly:

Unit: milliseconds
           expr        min         lq       mean     median         uq       max neval
  x[, the_loci]   7.173006   7.653467   8.361506   8.133443   8.884091  11.70068   100
  y[, the_loci]   7.970834   8.411625   9.207977   8.938945   9.796977  13.50356   100
  z[, the_loci]  22.982876  25.815514  30.168462  27.029380  28.965536  61.29198   100
 zz[, the_loci] 184.400632 216.074103 221.699854 222.185526 226.223172 380.18579   100

I have created for loops that will hopefully
preserve the class type of the object that went into
the "[" method.

The "[" method for genlight and SNPbin object would
create new objects every time they were subset. This
prevents these objects from being properly inherited
because everything returned from that method will be
an object of class "genlight" or "SNPbin". This breaks
inheritence because any new class created from
genlight or SNPbin objects will not be able to use
callNextMethod() for the "[" method because the class
of object returned will no longer be the one that went
in.
This combined with the direct subsetting of the SNPbin
object speeds up the process of subsetting by an order
of magnitude.

For benchmarking, microbenchmark was used with 4 data sets (x, y, z, and zz) containing 50 samples with the sizes 1e3, 1e4, 1e5 and 1e6 snps:

Subset and reconstruct (old method)

```
Unit: milliseconds
           expr        min         lq       mean     median         uq        max neval
  x[, the_loci]   12.69427   14.01981   15.37758   14.94392   16.15417   23.03208   100
  y[, the_loci]   25.41475   28.27175   30.67594   29.52679   31.66637   48.77981   100
  z[, the_loci]  187.44007  202.46392  209.83472  207.75600  216.46626  257.66354   100
 zz[, the_loci] 1774.08317 1815.80018 1854.20525 1838.57798 1882.47164 2085.26544   100
```

Subset directly:

```
Unit: milliseconds
           expr        min         lq       mean     median         uq       max neval
  x[, the_loci]   7.173006   7.653467   8.361506   8.133443   8.884091  11.70068   100
  y[, the_loci]   7.970834   8.411625   9.207977   8.938945   9.796977  13.50356   100
  z[, the_loci]  22.982876  25.815514  30.168462  27.029380  28.965536  61.29198   100
 zz[, the_loci] 184.400632 216.074103 221.699854 222.185526 226.223172 380.18579   100
```
thibautjombart added a commit that referenced this pull request Apr 20, 2015
Optimize subsetting in genlight objects
@thibautjombart thibautjombart merged commit e9a7ae0 into master Apr 20, 2015
@thibautjombart
Copy link
Owner

I think this is properly awesome. Well done and big thanks! =D

@thibautjombart thibautjombart deleted the genlight-optimize branch April 20, 2015 18:58
@zkamvar zkamvar mentioned this pull request Jul 31, 2015
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants