Report figshare data version in notebook output #44

awm33 · 2017-01-11T01:19:33Z

Track which version of the data (figshare or cancer data sha) that was used for a classifer

awm33 · 2017-01-22T21:19:33Z

@dhimmel and/or @cgreene Do you have any thoughts on the best way to handle versioning within the data loader? We currently use master from the cancer-data repo and a hard coded URL for the mutation data stored on figshare.

dhimmel · 2017-01-23T20:40:48Z

The data from figshare has versions. Therefore, it'd be ideal to specify a version and then download everything we need corresponding to that version. This is what machine-learning currently does.

The cognoml package has code for retrieving figshare data (currently using a class, previously via functions). We were hoping to move figshare logic to cognoma/figshare (although we never decided what exactly to do).

What data is needed from GitHub? We should just upload that to figshare so it can use the common versioning system.

cgreene · 2018-04-03T17:46:13Z

@dhimmel / @gwaygenomics : is this complete? I think that the ml-workers appear to be downloading whatever the latest figshare version is. Does that get reported to the users?

dhimmel · 2018-04-03T19:30:23Z

Does that get reported to the users?

I don't think it does. I am not sure whether core-service is even storing which figshare version is loaded. The source code for downloading the data is:

core-service/api/management/commands/acquiredata.py

Lines 21 to 39 in b9b2e4f

    
           disease_path = os.path.join(options['path'], 'diseases.tsv') 
        
           if not os.path.exists(disease_path): 
        
               disease_url = 'https://raw.githubusercontent.com/cognoma/cancer-data/master/download/diseases.tsv' 
        
               urlretrieve(disease_url, disease_path) 
        
           sample_path = os.path.join(options['path'], 'samples.tsv') 
        
           if not os.path.exists(sample_path): 
        
               sample_url = 'https://raw.githubusercontent.com/cognoma/cancer-data/master/data/samples.tsv' 
        
               urlretrieve(sample_url, sample_path) 
        
           gene_path = os.path.join(options['path'], 'genes.tsv') 
        
           if not os.path.exists(gene_path): 
        
               gene_url = 'https://raw.githubusercontent.com/cognoma/genes/master/data/genes.tsv' 
        
               urlretrieve(gene_url, gene_path) 
        
           mutation_path = os.path.join(options['path'], 'mutation-matrix.tsv.bz2') 
        
           if not os.path.exists(mutation_path): 
        
               mutation_url = 'https://ndownloader.figshare.com/files/7311953' 
        
               urlretrieve(mutation_url, mutation_path)

So it's using the latest from GitHub for all files besides mutation-matrix.tsv.bz2 in which case it hardlinks to the figshare version 6 file. Instead, I think we should have core service specify a specific figshare version and github commit. If we note the core service commit hash in the output notebook, this would be sufficient to lookup the data versions? (assuming whenever the core-service codebase gets update, the database is reloaded... not sure)

BTW the figshare has been downloaded 41,471 times. Either people are using this a lot (or more likely we're requesting it an insane number of times 😸

cgreene · 2018-05-20T12:42:13Z

If we could reconstruct those URLs and put them into the notebook template, that's probably the best way. We'd like users to be able to reproduce the analysis and I think this key ingredient (the exact right data) is missing.

awm33 added the task label Jan 16, 2017

dhimmel changed the title ~~Version Control Data~~ Report figshare data version in notebook output Apr 3, 2018

cgreene added the backlog label May 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report figshare data version in notebook output #44

Report figshare data version in notebook output #44

awm33 commented Jan 11, 2017

awm33 commented Jan 22, 2017

dhimmel commented Jan 23, 2017

cgreene commented Apr 3, 2018

dhimmel commented Apr 3, 2018

cgreene commented May 20, 2018

Report figshare data version in notebook output #44

Report figshare data version in notebook output #44

Comments

awm33 commented Jan 11, 2017

awm33 commented Jan 22, 2017

dhimmel commented Jan 23, 2017

cgreene commented Apr 3, 2018

dhimmel commented Apr 3, 2018

cgreene commented May 20, 2018