Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Report figshare data version in notebook output #44

Open
awm33 opened this issue Jan 11, 2017 · 5 comments
Open

Report figshare data version in notebook output #44

awm33 opened this issue Jan 11, 2017 · 5 comments

Comments

@awm33
Copy link
Member

awm33 commented Jan 11, 2017

Track which version of the data (figshare or cancer data sha) that was used for a classifer

@awm33 awm33 added the task label Jan 16, 2017
@awm33
Copy link
Member Author

awm33 commented Jan 22, 2017

@dhimmel and/or @cgreene Do you have any thoughts on the best way to handle versioning within the data loader? We currently use master from the cancer-data repo and a hard coded URL for the mutation data stored on figshare.

@dhimmel
Copy link
Member

dhimmel commented Jan 23, 2017

The data from figshare has versions. Therefore, it'd be ideal to specify a version and then download everything we need corresponding to that version. This is what machine-learning currently does.

The cognoml package has code for retrieving figshare data (currently using a class, previously via functions). We were hoping to move figshare logic to cognoma/figshare (although we never decided what exactly to do).

What data is needed from GitHub? We should just upload that to figshare so it can use the common versioning system.

@cgreene
Copy link
Member

cgreene commented Apr 3, 2018

@dhimmel / @gwaygenomics : is this complete? I think that the ml-workers appear to be downloading whatever the latest figshare version is. Does that get reported to the users?

@dhimmel dhimmel changed the title Version Control Data Report figshare data version in notebook output Apr 3, 2018
@dhimmel
Copy link
Member

dhimmel commented Apr 3, 2018

Does that get reported to the users?

I don't think it does. I am not sure whether core-service is even storing which figshare version is loaded. The source code for downloading the data is:

disease_path = os.path.join(options['path'], 'diseases.tsv')
if not os.path.exists(disease_path):
disease_url = 'https://raw.githubusercontent.com/cognoma/cancer-data/master/download/diseases.tsv'
urlretrieve(disease_url, disease_path)
sample_path = os.path.join(options['path'], 'samples.tsv')
if not os.path.exists(sample_path):
sample_url = 'https://raw.githubusercontent.com/cognoma/cancer-data/master/data/samples.tsv'
urlretrieve(sample_url, sample_path)
gene_path = os.path.join(options['path'], 'genes.tsv')
if not os.path.exists(gene_path):
gene_url = 'https://raw.githubusercontent.com/cognoma/genes/master/data/genes.tsv'
urlretrieve(gene_url, gene_path)
mutation_path = os.path.join(options['path'], 'mutation-matrix.tsv.bz2')
if not os.path.exists(mutation_path):
mutation_url = 'https://ndownloader.figshare.com/files/7311953'
urlretrieve(mutation_url, mutation_path)

So it's using the latest from GitHub for all files besides mutation-matrix.tsv.bz2 in which case it hardlinks to the figshare version 6 file. Instead, I think we should have core service specify a specific figshare version and github commit. If we note the core service commit hash in the output notebook, this would be sufficient to lookup the data versions? (assuming whenever the core-service codebase gets update, the database is reloaded... not sure)

BTW the figshare has been downloaded 41,471 times. Either people are using this a lot (or more likely we're requesting it an insane number of times 😸

@cgreene
Copy link
Member

cgreene commented May 20, 2018

If we could reconstruct those URLs and put them into the notebook template, that's probably the best way. We'd like users to be able to reproduce the analysis and I think this key ingredient (the exact right data) is missing.

# for free to join this conversation on GitHub. Already have an account? # to comment
Projects
None yet
Development

No branches or pull requests

3 participants