P2791™ Draft Standard for Bioinformatics Analyses Generated by High-Throughput Sequencing (HTS) to Facilitate Communication
Sponsor
Standards Committee
of the
IEEE Engineering in Medicine and Biology Society
Approved <Date Approved>
IEEE-SA Standards Board
Copyright © 2018 by The Institute of Electrical and Electronics Engineers, Inc.
Three Park Avenue
New York, New York 10016-5997, USA
All rights reserved.
This document is an unapproved draft of a proposed IEEE Standard. As such, this document is subject to change. USE AT YOUR OWN RISK! IEEE copyright statements SHALL NOT BE REMOVED from draft or approved IEEE standards, or modified in any way. Because this is an unapproved draft, this document must not be utilized for any conformance/compliance purposes. Permission is hereby granted for officers from each IEEE Standards Working Group or Committee to reproduce the draft document developed by that Working Group for purposes of international standardization consideration. IEEE Standards Department must be informed of the submission for consideration prior to any reproduction for international standardization consideration (stds.ipr@ieee.org). Prior to adoption of this document, in whole or in part, by another standards development organization, permission must first be obtained from the IEEE Standards Department (stds.ipr@ieee.org). When requesting permission, IEEE Standards Department will require a copy of the standard development organization's document highlighting the use of IEEE content. Other entities seeking permission to reproduce this document, in whole or in part, must also obtain permission from the IEEE Standards Department.
IEEE Standards Department
445 Hoes Lane
Piscataway, NJ 08854, USA
This standard establishes accurate and secure communication of bioinformatics protocols and data in order to facilitate bioinformatics workflow related exchange and communication between regulatory agencies, pharmaceutical companies, bioinformatics platform providers and researchers. Accurate communication helps ensure responsibility, reproducibility, verify bioinformatics protocol, track provenance information and promote interoperability. In addition, this standard also defines the assurance program for evaluating and certifying products against those requirements.
genomics, next generation sequencing, high throughput sequencing, massively parallel sequencing, NGS, HTS, MPS, workflow, pipeline, bioinformatics, analysis, regulatory
IEEE documents are made available for use subject to important notices and legal disclaimers. These notices and disclaimers, or a reference to this page, appear in all standards and may be found under the heading “Important Notices and Disclaimers Concerning IEEE Standards Documents.” They can also be obtained on request from IEEE or viewed at http://standards.ieee.org/ipr/disclaimers.html.
IEEE Standards documents (standards, recommended practices, and guides), both full-use and trial-use, are developed within IEEE Societies and the Standards Coordinating Committees of the IEEE Standards Association (“IEEE-SA”) Standards Board. IEEE (“the Institute”) develops its standards through a consensus development process, approved by the American National Standards Institute (“ANSI”), which brings together volunteers representing varied viewpoints and interests to achieve the final product. IEEE Standards are documents developed through scientific, academic, and industry-based technical working groups. Volunteers in IEEE working groups are not necessarily members of the Institute and participate without compensation from IEEE. While IEEE administers the process and establishes rules to promote fairness in the consensus development process, IEEE does not independently evaluate, test, or verify the accuracy of any of the information or the soundness of any judgments contained in its standards. IEEE Standards do not guarantee or ensure safety, security, health, or environmental protection, or ensure against interference with or from other devices or networks. Implementers and users of IEEE Standards documents are responsible for determining and complying with all appropriate safety, security, environmental, health, and interference protection practices and all applicable laws and regulations. IEEE does not warrant or represent the accuracy or content of the material contained in its standards, and expressly disclaims all warranties (express, implied and statutory) not included in this or any other document relating to the standard, including, but not limited to, the warranties of: merchantability; fitness for a particular purpose; non-infringement; and quality, accuracy, effectiveness, currency, or completeness of material. In addition, IEEE disclaims any and all conditions relating to: results; and workmanlike effort. IEEE standards documents are supplied “AS IS” and “WITH ALL FAULTS.” Use of an IEEE standard is wholly voluntary. The existence of an IEEE standard does not imply that there are no other ways to produce, test, measure, purchase, market, or provide other goods and services related to the scope of the IEEE standard. Furthermore, the viewpoint expressed at the time a standard is approved and issued is subject to change brought about through developments in the state of the art and comments received from users of the standard. In publishing and making its standards available, IEEE is not suggesting or rendering professional or other services for, or on behalf of, any person or entity nor is IEEE undertaking to perform any duty owed by any other person or entity to another. Any person utilizing any IEEE Standards document, should rely upon his or her own independent judgment in the exercise of reasonable care in any given circumstances or, as appropriate, seek the advice of a competent professional in determining the appropriateness of a given IEEE standard. IN NO EVENT SHALL IEEE BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO: PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE PUBLICATION, USE OF, OR RELIANCE UPON ANY STANDARD, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE AND REGARDLESS OF WHETHER SUCH DAMAGE WAS FORESEEABLE.
The IEEE consensus development process involves the review of documents in English only. In the event that an IEEE standard is translated, only the English version published by IEEE should be considered the approved IEEE standard.
A statement, written or oral, that is not processed in accordance with the IEEE-SA Standards Board Operations Manual shall not be considered or inferred to be the official position of IEEE or any of its committees and shall not be considered to be, or be relied upon as, a formal position of IEEE. At lectures, symposia, seminars, or educational courses, an individual presenting information on IEEE standards shall make it clear that his or her views should be considered the personal views of that individual rather than the formal position of IEEE.
Comments for revision of IEEE Standards documents are welcome from any interested party, regardless of membership affiliation with IEEE. However, IEEE does not provide consulting information or advice pertaining to IEEE Standards documents. Suggestions for changes in documents should be in the form of a proposed change of text, together with appropriate supporting comments. Since IEEE standards represent a consensus of concerned interests, it is important that any responses to comments and questions also receive the concurrence of a balance of interests. For this reason, IEEE and the members of its societies and Standards Coordinating Committees are not able to provide an instant response to comments or questions except in those cases where the matter has previously been addressed. For the same reason, IEEE does not respond to interpretation requests. Any person who would like to participate in revisions to an IEEE standard is welcome to join the relevant IEEE working group.
Comments on standards should be submitted to the following address:
Secretary, IEEE-SA Standards Board
445 Hoes Lane
Piscataway, NJ 08854 USA
Users of IEEE Standards documents should consult all applicable laws and regulations. Compliance with the provisions of any IEEE Standards document does not imply compliance to any applicable regulatory requirements. Implementers of the standard are responsible for observing or referring to the applicable regulatory requirements. IEEE does not, by the publication of its standards, intend to urge action that is not in compliance with applicable laws, and these documents may not be construed as doing so.
IEEE draft and approved standards are copyrighted by IEEE under U.S. and international copyright laws. They are made available by IEEE and are adopted for a wide variety of both public and private uses. These include both use, by reference, in laws and regulations, and use in private self-regulation, standardization, and the promotion of engineering practices and methods. By making these documents available for use and adoption by public authorities and private users, IEEE does not waive any rights in copyright to the documents.
Subject to payment of the appropriate fee, IEEE will grant users a limited, non-exclusive license to photocopy portions of any individual standard for company or organizational internal use or individual, non-commercial use only. To arrange for payment of licensing fees, please contact Copyright Clearance Center, Customer Service, 222 Rosewood Drive, Danvers, MA 01923 USA; +1 978 750 8400. Permission to photocopy portions of any individual standard for educational classroom use can also be obtained through the Copyright Clearance Center.
Users of IEEE Standards documents should be aware that these documents may be superseded at any time by the issuance of new editions or may be amended from time to time through the issuance of amendments, corrigenda, or errata. A current IEEE document at any point in time consists of the current edition of the document together with any amendments, corrigenda, or errata then in effect. Every IEEE standard is subjected to review at least every ten years. When a document is more than ten years old and has not undergone a revision process, it is reasonable to conclude that its contents, although still of some value, do not wholly reflect the present state of the art. Users are cautioned to check to determine that they have the latest edition of any IEEE standard. In order to determine whether a given document is the current edition and whether it has been amended through the issuance of amendments, corrigenda, or errata, visit IEEE Xplore at http://ieeexplore.ieee.org/ or contact IEEE at the address listed previously. For more information about the IEEE-SA or IEEE’s standards development process, visit the IEEE-SA Website at http://standards.ieee.org.
Errata, if any, for all IEEE standards can be accessed on the IEEE-SA Website at the following URL: http://standards.ieee.org/findstds/errata/index.html. Users are encouraged to check this URL for errata periodically.
Attention is called to the possibility that implementation of this standard may require use of subject matter covered by patent rights. By publication of this standard, no position is taken by the IEEE with respect to the existence or validity of any patent rights in connection therewith. If a patent holder or patent applicant has filed a statement of assurance via an Accepted Letter of Assurance, then the statement is listed on the IEEE-SA Website at http://standards.ieee.org/about/sasb/patcom/patents.html. Letters of Assurance may indicate whether the Submitter is willing or unwilling to grant licenses under patent rights without compensation or under reasonable rates, with reasonable terms and conditions that are demonstrably free of any unfair discrimination to applicants desiring to obtain such licenses. Essential Patent Claims may exist for which a Letter of Assurance has not been received. The IEEE is not responsible for identifying Essential Patent Claims for which a license may be required, for conducting inquiries into the legal validity or scope of Patents Claims, or determining whether any licensing terms or conditions provided in connection with submission of a Letter of Assurance, if any, or in any licensing agreements are reasonable or non-discriminatory. Users of this standard are expressly advised that determination of the validity of any patent rights, and the risk of infringement of such rights, is entirely their own responsibility. Further information may be obtained from the IEEE Standards Association.
At the time this draft standard was completed, the P2791 Working Group had the following membership:
Raja Mazumder, Chair
Vahan Simonyan, Vice Chair
Ogan Abaan, Jonas Almeida, Gil Alterovitz, Payal Banerjee, Amanda Bell, Surajit Bhattacharya, Lee Black, Ben Busby, Kristy Cloyd-Warwick, Ryan Connor, Michael Crusoe, Dennis Dean, Paul Duncan, Josep Gelpi, Carole Goble, Jeremy Goecks, Jonathan Jacobs, Robel Kahsay, Jonathon Keeney, Charles Hadley King, Jonathan LoTempio, Xeandong Meng, David Michaels, Hiroki Morizono, Rahi Navelkar, Asa Oudes, Janisha Patel, John Penn, Megan Pottersbusch, Jonathan Pryke, Stian Soiland-Reyes, Dan Taylor, Jason Travis, Paul Walsh, Jianchao Yao
The following members of the individual/entity balloting committee voted on this standard. Balloters may have voted for approval, disapproval, or abstention.
[To be supplied by IEEE]
Balloter1
Balloter2
Balloter3
Balloter4
Balloter5
Balloter6
Balloter7
Balloter8
Balloter9
When the IEEE-SA Standards Board approved this standard on <Date Approved>
, it had the following membership:
[To be supplied by IEEE]
<Name>, Chair
<Name>, Vice Chair
<Name>, Past Chair
Konstantinos Karachalios, Secretary
SBMember1
SBMember2
SBMember3
SBMember4
SBMember5
SBMember6
SBMember7
SBMember8
SBMember9
*Member Emeritus
This introduction is not part of P2791/D1, Draft Standard for Bioinformatics Analyses Generated by High-Throughput Sequencing (HTS) to Facilitate Communication.
BioCompute standardizes bioinformatics workflows in the genomic analysis space. BioCompute addresses the tremendous variability and uncertainty in communicating bioinformatics workflows and data related to analysis as a result of high throughput sequencing (HTS). The need to resolve issues in communication was felt particularly strongly between the United States Food and Drug Administration (FDA) and the entities that submit any work to the FDA for regulatory analysis that includes an HTS component . A plan to for BioCompute and initial goals of the project were drafted in a collaboration between the George Washington University and the FDA in 2014. The project has grown since then to include publications, workshops, applied use cases, and a large community of participants and collaborators. The standard is intended:
- to be both human and machine readable,
- to be applied to genomic analysis workflows, and
- to able to capture all details related to a workflow in such a way as to facilitate efficient communication and improve reproducibility and interoperability.
Every effort is made to accommodate any tool, platform or script, and to be adaptable to future developments in this field under a unified set of descriptions to standardize and streamline the representations of such complex bioinformatics processes.
BioCompute is a standard and a BioCompute Object (BCO) is an instance of that standard. High throughput sequencing (HTS), also referred to as next-generation sequencing (NGS) or massively parallel sequencing (MPS), has increased the pace at which we generate, compute and share genomic data in biomedical sciences. As a result, scientists, clinicians and regulators are now faced with a new data paradigm that is less portable, more complex and most of all poorly standardized. The BCO uses a simple JSON format to encode important information on the execution of computational pipelines, or for the creation of knowledge bases. BioCompute can be process oriented (for software pipelines) and/or product oriented (for knowledge bases). So error domain can include information to do QA and/or QC. The goal of using a BCO is to streamline communication of these otherwise difficult to elucidate details between stakeholders in academia, industry and regulatory agencies. Encapsulating HTS data processing in a BCO will facilitate swift communications between the FDA and other stakeholders who seek regulatory review/approval hence reducing the burden and time to decision.
The US Food and Drug Administration (FDA) and George Washington University (GW) have partnered to establish a framework for community-based standards development and harmonization of HTS computations and data formats. Standardized HTS data processing descriptions and data formats will promote interoperability and simplify the verification of the bioinformatics protocols applied against data. To do this, a schema has been developed to represent instances of computational analysis as a BCO. A BCO includes:
- Information about parameters and versions of the executable programs in a pipeline
- Reference to input and output test data for verification of the pipeline
- A usability domain
- Keywords
- A list of agents involved along with other important metadata, such as their specific contribution
Knowledge of input data is intended to be captured according to existing efforts, including MIRAGE, MIAPE, and STRENDA, and to be in accordance with Minimum Information Standards. In addition to all the information captured in the BCO, the BCO itself must be independent of the execution environment, whether it is a local or a cloud-based infrastructure.
- 1. Overview
- 1.1 General
- 1.2 Scope
- 1.3 Purpose
- 2. Normative references
- 3. Definitions, acronyms, and abbreviations
- 3.1 Acronyms and abbreviations
- 4. BioCompute Standard
- 4.1 General
- Annex A (informative) Bibliography
Draft Standard for Bioinformatics Analyses Generated by High-Throughput Sequencing (HTS) to Facilitate Communication
The BioCompute standard captures relevant information from a high throughput sequencing workflow in order to enable a user to understand and interpret the workflow efficiently and with high confidence. BioCompute is a standard that is particularly well adapted to regulatory review. Pursuant to this, workflow steps and prerequisites to execute workflow steps are recorded in detail in the BioCompute standard. Information is recorded using key/value pairs in JavaScript Object Notation (JSON), adhering to the JSON Schema. Key/value pairs are organized by domains;
- The Provenance Domain - tracks metadata
- The Usability Domain - tracks what was done
- The Extension Domain - provide user-defined fields
- The Description Domain - captures a description of external resources, pipeline steps, and the relationships of I/O objects
- The Execution Domain - describes information needed for deployment, software configuration and running applications in a dependent environment
- The Parametric Domain - captures all parameters that customize a computational flow
- The Input and Output Domain - contains a list of global input and output files
- The Error Domain - describes errors, including the limits of detectability, false positives, false negatives, statistics confidence of outcomes, and description of errors (i.e. empirical or algorithmic).
This standard establishes accurate and secure communication of bioinformatics protocols in order to facilitate bioinformatics workflow related exchange and communication between regulatory agencies, pharmaceutical companies, bioinformatics platform providers and researchers. Accurate communication helps ensure responsibility, verify bioinformatics protocol, track provenance information and promote interoperability.
The standards allows for the cross platform communications of complex computation from inception to manufacturing of medical products and services, resulting in decreased costs of drug discovery and review, and accelerated delivery of treatment to patients.
The following referenced documents are indispensable for the application of this document (i.e., they must be understood and used, so each referenced document is cited in text and its relationship to this document is explained). For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments or corrigenda) applies. BioCompute Schema JSON Schema description_domain execution_domain io_domain parametric_domain provenance_domain usability_domain
For the purposes of this document, the following terms and definitions apply. The IEEE Standards Dictionary Online should be consulted for terms not defined in this clause.
BCO BioCompute Object JSON JavaScript Object Notation FHIR Fast Healthcare Interoperability Resources SCM Source Control Management
A BCO is a text file written in JSON data structure that shall consist of all domains required by the BioCompute Schema . A BCO shall be written in JSON Schema, and therefore invokes all of the requirements of the JSON Schema. The minimum requirement to execute the standard is the fully organized BCO containing all domains in proper JSON Schema format. Pursuant to JSON schema, the required fields are listed at the top of the BCO.
The fully organized BCO file is hosted in the schemas folder, along with related files. All the files in the schemas folder are linked together (using JSON pointers as described by the JSON Schema), being referenced by the biocomputeobject.json
file. For development purposes, these files are used to track changes, but some are not required to adhere to the standard. Those required for a complete BCO are the biocomputeobject.json, description_domain.json
, execution_domain.json
, io_domain.json
, parametric_domain.json
, provenance_domain.json
, and usability_domain.json
. The error_domain.json
is an optional domain that further describes a bioinformatics workflow, and the extension_domain is an optional domain that contains user-defined fields.
The top three lines of a BCO (bco_spec_version
, bco_id
, and checksum
) are metadata that describe the BCO. These lines are external to all domains. The checksum is calculated on all following lines.
Files in the schemas folder are kept separate for organization. References in the BCO schema ($ref) to these files should be replaced with the proper domain from the appropriate file. For example, line 141 (“$ref”: provenance_domain.json
) is a reference to the structure specified in the provenance_domain.json
file. The BCO Schema builds on the JSON Schema by adding domains in a way that facilitates the communication of bioinformatics workflows. A description of the domain files follows. In addition, two examples have been generated by the community of users, as well as a tool to automate the creation of a file using the BCO schema standard.
The Description Domain of a BioCompute Object contains a description of external resources, pipeline steps, and the relationship of I/O objects.
The Error Domain contains information related to the bounds of detection (such as the minimum sequence depth and minimum sequence coverage), and statistical analyses of the pipeline (such as the false negative and false positive rates). Fields in the Error Domain can be determined algorithmically (by repeatedly invoking the pipeline with the same data) or empirically (by invoking the pipeline with different data, often synthetically generated data).
The Execution Domain of a BioCompute Object contains information needed for deployment, software configuration, and running applications in a dependent environment. This may include scripts, drivers, environment variables, and other software prerequisites.
The IO Domain of a BioCompute Object is a list of global input and output files that may exist on local machine or on another machine. It does not include references to intermediate files.
The Parametric Domain of a BioCompute Object includes any parameters used in a workflow. This is typically used only in the context of parameters changed from default settings for ease of understanding.
The Provenance Domain contains metadata related to the BCO and is not used for computation ,. It is used to track the flow of data from original source to final computation, and includes contributors, reviewers, and versioning. The Usability Domain of a BioCompute Object is a plain language description of what was done in the workflow. This is not used for computation, and should align with the actual steps described elsewhere in the BCO. The Usability Domain conveys the purpose of the BCO, and improves searchability of the BCO. It is recommended that a novel use of the BCO could result in the creation of a new entry with a new Usability Domain.
The Extension Domain allows a user to define additional fields and is optional. A separate folder called extension_domain exists within the schemas folder. Two Extension Domain example files exist in the extension_domain folder that describe how a BCO can include a reference to FHIR (Extension Domain example: FHIR ) and/or to SCM (Extension Domain example: SCM ).
Additional helpful resources have been created, including a Community User Guide for Best Practices. This document describes ways in which the schema has been used and is known to be effective, using these to derive best practices. In addition, a repository of examples exists, which includes the use of optional Error Domain. A BCO Editor tool has also been generated. The BCO Editor is an example implementation of the schema, and can be used to create and edit BCO’s. Finally, a script to validate that documents have been created according to the BCO schema is also available for use. This python tool will check a document to ensure that it has been created according to the current BCO Schema.
Bibliographical references are resources that provide additional or helpful material but do not need to be understood or used to implement this standard. Reference to these resources is made for informational use only.
Community User Guide for Best Practices