event: '2015 Brainhack at OHBM'
title: 'Sharing Data in the Cloud'
author:
- initials: DO surname: O'Connor firstname: David email: david.oconnor@childmind.org affiliation: aff1, aff2 corref: aff1
- initials: DJC surname: Clark firstname: Daniel J. email: daniel.clark@childmind.org affiliation: aff2
- initials: MPM surname: Milham firstname: Michael P. email: michael.milham@childmind.org affiliation: aff1, aff2
- initials: RCC surname: Craddock firstname: R. Cameron email: ccraddock@nki.rfmh.org affiliation: aff1, aff2
affiliations:
- id: aff1 orgname: 'Center for Biomedical Imaging and Neuromodulation, Nathan Kline Institute for Psychiatric Research' street: 140 Old Orangeburg Rd postcode: 10962 city: Orangeburg state: New York country: USA
- id: aff2 orgname: 'Center for the Developing Brain, Child Mind Institute' street: 445 Park Ave postcode: 10022 city: New York state: New York country: USA
url: https://github.com/DaveOC90/INDI-Organization-Scripts
coi: None
acknow: The authors would like to thank the organizers and attendees of the OHBM Brainhack in Hawaii. This project was made possible by the S3 public bucket generously provided by Amazon Web Services.
contrib: DO performed quality control, and uploaded the data. DJC wrote code to interact with AWS, preprocessed and uploaded data. MPM and RCC lead the data collection and sharing projects. All of the authors contributed to writing the project report.
bibliography: brainhack-report
gigascience-ref: \href{http://gigadb.org/dataset/100233}{doi:10.5524/100233} ...
#Introduction Cloud computing resources, such as Amazon Web Services\footnote{\url{http://aws.amazon.com}} (AWS), provide pay-as-you-go access to high-performance computer resources and dependable data storage solutions for performing large scale analyses of neuroimaging data\cite{Clark2015}. These are particularly attractive for researchers at small universities and in developing countries who lack the wherewithal to maintain their own high performance computing systems. The objective of this project is to upload data from the 1000 Functional Connectomes Project (FCP)\cite{biswal2010} and International Neuroimaging Datasharing Initiatives (INDI) \cite{mennes2013} grass-roots data sharing initiatives into a Public S3 Bucket that has been generously provided by AWS. This will make the data more quickly accessible for AWS-based analysis of these data, but will also improve the speed and availability of access to this data for analyses performed outside of the cloud. To begin with, we focused on the following collections:
- \begin{sloppypar} The \emph{Autism Brain Imaging Data Exchange} \emph{(ABIDE)} consists of structural MRI and resting state functional MRI from 1113 individuals (164 F, 948 M, 6-64 years old, 539 with autism spectrum disorders, 573 typical controls) aggregated from 20 different studies \cite{dimartino2014}. \end{sloppypar}
- The ADHD-200 contains structural MRI and resting state functional MRI from 973 individuals (352 F, 594 M, 7-21 years old, 362 with attention deficit hyperactivity disorder (ADHD), 585 typically developing controls) collected from 8 sites \cite{Milham2012}.
- The Consortium for Reliability and Reproducibility (CoRR) consists of 3,357 structural MRI, 5,093 resting state fMRI, 1,302 diffusion MRI, and 300 cerebral blood flow scans from 1629 subjects (673 F, 956 M, 6-84 years old, all typical controls) acquired in a variety of test-retest designs at 35 sites \cite{zuo2014}.
- The Enhanced Nathan Kline Institute - Rockland Sample (ENKI-RS) consists of structural MRI, resting state functional MRI, diffusion MRI, cerebral blood flow, and a variety of task functional MRI scans and deep phenotyping on over 700 participants from across the lifespan and a variety of phenotypes acquired at a single site \cite{nooner2012}. The acquisition of this collection is ongoing.
- The Addiction Connectome Preprocessed Initiative (ACPI)\footnote{\url{http://fcon_1000.projects.nitrc.org/indi/ACPI/html/index.html}} consists of 216 structural MRI and 252 functional MRI from 192 subjects (44 F, 148 M, 18-50 years old) from three datasets generated by NIDA investigators.
#Approach Data for the ADHD-200, ABIDE, CoRR, and Rockland Sample data collections are currently downloadable from NITRC\footnote{\url{http://fcon_1000.projects.nitrc.org/}} as a series of large (>2GB) tar files. The process of uploading the data involved downloading and extracting the data from these tar files, organizing the individual images to the standardized INDI format \footnote{\url{http://fcon_1000.projects.nitrc.org/indi/indi_data_contribution_guide.pdf}}, and then uploading the data to S3. We developed a S3 upload script in python using the Boto AWS software development kit\footnote{\url{https://aws.amazon.com/sdk-for-python/}} to facilitate this process. We also developed a download script in python that provides basic query functionality for selecting the data to download from a spreadsheet describing the data.
#Results The entirety of the CoRR, ABIDE, ACPI, and ADHD-200 data collections and ENKIRS data for 427 individuals were uploaded during the OHBM Hackathon event. The data are available as individual files to make it easily indexable by database infrastructures such as COINs \cite{landis2016}, LORIS \cite{Das2011}, and others. Additionally, this makes it easy for the users to download just the data that they want. The data in the bucket can be browsed and downloaded using a GUI based S3 file transfer software such as Cyberduck\footnote{\url{http://cyberduck.org}} (see Fig. 1), or using the Boto python library\footnote{\url{https://github.com/FCP-INDI/INDI-Tools}}. One can connect to the bucket using the configuration shown in Figure 1. The data is structured as follows: bucketname/data/Projects/ProjectName/DataType. For example you can access raw data from the ENKI-RS, as shown in Figure 1, by specifying the following path in CyberDuck: \url{https://s3.amazon.aws/fcp-indi/data/Projects/RocklandSample/RawData}.
\begin{figure}[h!] \includegraphics[width=.47\textwidth]{cyberduck_screenshot.png} \caption{\label{centfig} Connecting to the data repository.} \end{figure}
Uploading data shared through the FCP and INDI initiatives improves its accessibility for cloud-based and local computation. Future efforts for this project will include uploading the remainder of the FCP and INDI data and organizing the data in the new brain imaging data structure (BIDS) format \cite{Gorgolewski2015}.