From 20d780ba1b51d1ca95d39eff31302015d000dfdf Mon Sep 17 00:00:00 2001
From: "github-actions[bot]" <github-actions[bot]@users.noreply.github.com>
Date: Mon, 23 Oct 2023 15:18:21 +0000
Subject: [PATCH] Deployed 5c0717c to dev with MkDocs 1.3.0 and mike 1.1.2

---
 dev/4.Syntax/index.html           |   4 ++--
 dev/5.Core_identifiers/index.html |   8 ++++----
 dev/index.html                    |   2 +-
 dev/search/search_index.json      |   2 +-
 dev/sitemap.xml.gz                | Bin 204 -> 204 bytes
 5 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/dev/4.Syntax/index.html b/dev/4.Syntax/index.html
index 78fa2d1..731aaf3 100644
--- a/dev/4.Syntax/index.html
+++ b/dev/4.Syntax/index.html
@@ -114,8 +114,8 @@
                 <h1 id="4-syntax">4 Syntax</h1>
 <p>A SWHID consists of two separate parts, a mandatory <em>core identifier</em> that can
 identify any software artifact (or "object"), and an optional list of
-<em>qualifiers</em> that allows to specify the context where the object is meant to be
-seen and point to a subpart of the object itself.</p>
+<em>qualifiers</em> that allows specification of the context where the object is meant to be
+seen and points to a subpart of the object itself.</p>
 <p>Syntactically, SWHIDs are generated by the <code>&lt;identifier&gt;</code> entry point in
 the following grammar:</p>
 <pre class="codehilite"><code class="language-bnf">&lt;identifier&gt; ::= &lt;core_identifier&gt; [ &lt;qualifiers&gt; ] ;
diff --git a/dev/5.Core_identifiers/index.html b/dev/5.Core_identifiers/index.html
index 2c4ef98..dd41b7f 100644
--- a/dev/5.Core_identifiers/index.html
+++ b/dev/5.Core_identifiers/index.html
@@ -141,7 +141,7 @@ <h1 id="5-core-identifiers">5 Core identifiers</h1>
 </ul>
 <p>The fourth field is the <em>intrinsic identifier</em> of the object.
 This is a hex-encoded (using lowercase ASCII characters) hash value
-computed by the content and relevant metadata of the object.</p>
+computed from the content and relevant metadata of the object.</p>
 <h2 id="51-contents">5.1 Contents</h2>
 <p>A <em>content</em> is an uninterpreted byte sequence, typically, the content of a file.
 For this type of object the intrinsic identifier is the <code>sha1_git</code> hash of it,
@@ -185,12 +185,12 @@ <h2 id="52-directories">5.2 Directories</h2>
 <li>a NULL byte,</li>
 <li>and the previously obtained serialization.</li>
 </ul>
-<p>As an example, <code>swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505</code> is the
+<p>As an example, <code>swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505</code> 
 is the SWHID computed from
-<a href="https://archive.softwareheritage.org/swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505">a directory containing the source code of the darktable photography application</a> as
+<a href="https://archive.softwareheritage.org/swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505">a directory containing the source code of the darktable photography application</a> at
 a given point in time of its development on May 4th 2017.</p>
 <h2 id="53-revisions">5.3 Revisions</h2>
-<p>Software development within a specific project is essentially a time-indexed series of copies of a single “root” directory that contains the entire project source code. Software evolves when a developer modifies the content of one or more files in that directory and record their changes.</p>
+<p>Software development within a specific project is essentially a time-indexed series of copies of a single “root” directory that contains the entire project source code. Software evolves when a developer modifies the content of one or more files in that directory and records their changes.</p>
 <p>Each recorded copy of the root directory is known as a “revision”. It points to a single fully-determined directory and is equipped with arbitrary metadata. Some of those are added manually by the developer (e.g., revision message), others are automatically synthesized (timestamps, parent revision(s), etc).</p>
 <p>The supported metadata is as follows:</p>
 <ul>
diff --git a/dev/index.html b/dev/index.html
index 0972a82..7252830 100644
--- a/dev/index.html
+++ b/dev/index.html
@@ -170,5 +170,5 @@ <h1 id="the-swhid-specification-version-11">The SWHID Specification Version 1.1<
 
 <!--
 MkDocs version : 1.3.0
-Build Date UTC : 2023-10-23 13:47:38.614365+00:00
+Build Date UTC : 2023-10-23 15:18:21.877787+00:00
 -->
diff --git a/dev/search/search_index.json b/dev/search/search_index.json
index ae4bf85..2b9aa16 100644
--- a/dev/search/search_index.json
+++ b/dev/search/search_index.json
@@ -1 +1 @@
-{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"The SWHID Specification Version 1.1 Copyright \u00a9 2022-2023 SWHID Contributors. This work is licensed under the Community Specification License 1.0. With thanks to Alexios Zavras, Jean-Francois Abramatic, Roberto Di Cosmo, and Stefano Zacchiroli for their contributions and assistance.","title":"Copyright"},{"location":"#the-swhid-specification-version-11","text":"Copyright \u00a9 2022-2023 SWHID Contributors. This work is licensed under the Community Specification License 1.0. With thanks to Alexios Zavras, Jean-Francois Abramatic, Roberto Di Cosmo, and Stefano Zacchiroli for their contributions and assistance.","title":"The SWHID Specification Version 1.1"},{"location":"0.Foreword/","text":"Foreword ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that committee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization. The procedures used to develop this document and those intended for its further maintenance are described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types of ISO documents should be noted. This document was drafted in accordance with the editorial rules of the ISO/IEC Directives, Part 2 (see https://www.iso.org/directives ). Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of any patent rights identified during the development of the document will be in the Introduction and/or on the ISO list of patent declarations received (see https://www.iso.org/patents ). Any trade name used in this document is information given for the convenience of users and does not constitute an endorsement. For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions related to conformity assessment, as well as information about ISO's adherence to the World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see https://www.iso.org/iso/foreword.html . This document was prepared by XXX. Any feedback or questions on this document should be directed to the user's national standards body. A complete listing of these bodies can be found at https://www.iso.org/members.html .","title":"Foreword"},{"location":"0.Foreword/#foreword","text":"ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that committee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization. The procedures used to develop this document and those intended for its further maintenance are described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types of ISO documents should be noted. This document was drafted in accordance with the editorial rules of the ISO/IEC Directives, Part 2 (see https://www.iso.org/directives ). Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of any patent rights identified during the development of the document will be in the Introduction and/or on the ISO list of patent declarations received (see https://www.iso.org/patents ). Any trade name used in this document is information given for the convenience of users and does not constitute an endorsement. For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions related to conformity assessment, as well as information about ISO's adherence to the World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see https://www.iso.org/iso/foreword.html . This document was prepared by XXX. Any feedback or questions on this document should be directed to the user's national standards body. A complete listing of these bodies can be found at https://www.iso.org/members.html .","title":"Foreword"},{"location":"0.Introduction/","text":"Introduction Modern software relies heavily on open source components that are developed collaboratively in a distributed setting, and that are assembled to create complex systems that evolve at a fast pace. This has strengthened the need to precisely track, ensure availability, and guarantee integrity of the components that go into a given system for a variety of stakeholders. Academia needs to ensure that research results are reproducible, industry needs to improve the traceability of the software supply chain, developer communities need tools to cope with the increasing complexity. A key building block for addressing this issue is a system of intrinsic identifiers that allows to precisely pinpoint the exact version of any software artifact, at all levels of granularity, without relying on any central registry or naming authority. With this specification, the SWHID working group makes such a system of intrinsic identifiers, originally developed for the Software Heritage universal source code archive, available to all stakeholders. For the sake of clarity, we will use examples drawn directly from the Software Heritage archive, but notice that systems for persistent archival of software artifacts, as well as resolution of SWHIDs are out of the scope of this specification, and the SWHID specification does not require in any way the use of Software Heritage.","title":"Introduction"},{"location":"0.Introduction/#introduction","text":"Modern software relies heavily on open source components that are developed collaboratively in a distributed setting, and that are assembled to create complex systems that evolve at a fast pace. This has strengthened the need to precisely track, ensure availability, and guarantee integrity of the components that go into a given system for a variety of stakeholders. Academia needs to ensure that research results are reproducible, industry needs to improve the traceability of the software supply chain, developer communities need tools to cope with the increasing complexity. A key building block for addressing this issue is a system of intrinsic identifiers that allows to precisely pinpoint the exact version of any software artifact, at all levels of granularity, without relying on any central registry or naming authority. With this specification, the SWHID working group makes such a system of intrinsic identifiers, originally developed for the Software Heritage universal source code archive, available to all stakeholders. For the sake of clarity, we will use examples drawn directly from the Software Heritage archive, but notice that systems for persistent archival of software artifacts, as well as resolution of SWHIDs are out of the scope of this specification, and the SWHID specification does not require in any way the use of Software Heritage.","title":"Introduction"},{"location":"1.Scope/","text":"1 Scope This SoftWare Hash IDentifier (SWHID) specification defines a standard data format for referencing digital artifacts that fit in the data model of modern distributed version control systems. This includes the typical tree-like structure of a filesystem hierarchy, but also special nodes to track revisions and releases, as well as the full status of a version control system, with all its development branches. A key property of SWHIDs is that they can be computed using cryptographically strong functions directly from the digital objects they refer to, by anyone that has access to a copy of them. This enables decentralised and independent verification of integrity, without relying on a registry or a central authority. The computation of the SWHID identifiers is based on Merkle Acyclic Directed Graphs, a natural generalization of Merkle trees. The resolution of SWHIDs, i.e. the process of obtaining a copy of a digital artifact corresponding to a given SWHID, is out of the scope of this specification.","title":"Clause 1: Scope"},{"location":"1.Scope/#1-scope","text":"This SoftWare Hash IDentifier (SWHID) specification defines a standard data format for referencing digital artifacts that fit in the data model of modern distributed version control systems. This includes the typical tree-like structure of a filesystem hierarchy, but also special nodes to track revisions and releases, as well as the full status of a version control system, with all its development branches. A key property of SWHIDs is that they can be computed using cryptographically strong functions directly from the digital objects they refer to, by anyone that has access to a copy of them. This enables decentralised and independent verification of integrity, without relying on a registry or a central authority. The computation of the SWHID identifiers is based on Merkle Acyclic Directed Graphs, a natural generalization of Merkle trees. The resolution of SWHIDs, i.e. the process of obtaining a copy of a digital artifact corresponding to a given SWHID, is out of the scope of this specification.","title":"1 Scope"},{"location":"2.Normative_references/","text":"2 Normative references The following documents are referred to in the text in such a way that some or all of their content constitutes requirements of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies. RFC-3174, US Secure Hash Algorithm 1 (SHA1) , The Internet Society Network Working Group, https://tools.ietf.org/html/rfc3174 RFC-3986, Uniform Resource Identifier (URI): Generic Syntax , The Internet Society Network Working Group, https://tools.ietf.org/html/rfc3986 RFC-3987, Internationalized Resource Identifiers (IRIs) , The Internet Society Network Working Group, https://tools.ietf.org/html/rfc3987 RFC-5234, Augmented BNF for Syntax Specifications: ABNF , The Internet Society Network Working Group, https://tools.ietf.org/html/rfc5234","title":"Clause 2: Normative references"},{"location":"2.Normative_references/#2-normative-references","text":"The following documents are referred to in the text in such a way that some or all of their content constitutes requirements of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies. RFC-3174, US Secure Hash Algorithm 1 (SHA1) , The Internet Society Network Working Group, https://tools.ietf.org/html/rfc3174 RFC-3986, Uniform Resource Identifier (URI): Generic Syntax , The Internet Society Network Working Group, https://tools.ietf.org/html/rfc3986 RFC-3987, Internationalized Resource Identifiers (IRIs) , The Internet Society Network Working Group, https://tools.ietf.org/html/rfc3987 RFC-5234, Augmented BNF for Syntax Specifications: ABNF , The Internet Society Network Working Group, https://tools.ietf.org/html/rfc5234","title":"2 Normative references"},{"location":"3.Terms_and_definitions/","text":"3 Terms and definitions For the purposes of this document, the following terms and definitions apply. ISO and IEC maintain terminological databases for use in standardization at the following addresses: ISO Online browsing platform: available at https://www.iso.org/obp IEC Electropedia: available at http://www.electropedia.org/ 3.1 branch In the context of version control systems, a branch is a parallel line of development that stems from the main line (commonly known as the \"main\" or \"master\" branch). It allows developers to isolate their work for a particular feature or bug fix without affecting the main line of development. Once the work is complete and tested, it can be merged back into the main branch. 3.2 git Git is a distributed version control system created by Linus Torvalds in 2005. It allows teams of programmers to work on the same code base without overwriting each other's changes. Git is known for its speed, data integrity, and support for distributed, non-linear workflows. Each Git directory on every computer is a full-fledged repository with complete history and version tracking abilities, independent of network access or a central server. 3.3 hierarchical file system A hierarchical file system is a method of organizing and managing files in a computer where data is stored hierarchically (in a structure often visualized as a tree). It uses directories (or 'folders') to organize files into a tree structure. Each directory can contain more files and directories, thus forming a hierarchical structure. 3.4 intrinsic identifier An identifier that can be computed directly from the object that it identifies, without needing a registry. Typical examples are cryptographically strong hashes. 3.5 repository In the context of version control systems, a repository is a storage location for software development artifacts including but not limited to source code, build scripts, documentation, etc. It often includes metadata about the stored items, such as version number, author, date of the last modification, etc. Repositories can be local or remote and are managed by version control systems like Git. 3.6 SHA1 SHA-1 (short for \"Secure Hash Algorithm 1\", also stylized as \" SHA1 \") is a hash function that takes as input a sequence of bytes and produces a 160-bit (20-byte) hash value. The returned value is called SHA1 checksum , or simply SHA1 when there is no risk of ambiguity between the function and the returned value. A detailed description of how to compute SHA1 is available in RFC-3174. In the wake of the Shattered attack of 2017 (see paper: Stevens2017Shattered ), it is now possible to produce collision-prone files that are different but return the same SHA1 checksums. It is however possible to detect, during SHA1 computation, such SHA1-colliding files using counter-cryptanalysis (see paper: Stevens2013Counter ). As collision-prone files are problematic from the point of view of unequivocal identification and integrity verification, the SWHID standard takes measures to avoid that such files are referenced using only SHA1 checksums. For the purpose of this specification document, the SHA1 function is therefore considered to be a partial function, that only returns a value when a Shattered-style collision is not detectable using the techniques described in Stevens2013Counter and the reference implementation of it available at https://github.com/cr-marcstevens/sha1collisiondetection (version stable-v1.0.3 , corresponding to Git commit ID 38096fc021ac5b8f8207c7e926f11feb6b5eb17c ). When such a collision is detected during SHA1 computation, no SHA1 can be obtained for the object in question and hence, depending on the context, a valid SWHID might not exist for it. Note that in most cases SHA1 in this specification are computed on objects after adding specific headers to them, making \"trivial\" collision-prone files still perfectly valid and hence referenceable using SWHIDs. 3.7 version control system A version control system (VCS), also known as source control or revision control, is a software tool that helps manage different versions of software development artifacts. It keeps track of all changes made to the code, allows multiple developers to work on the same codebase, and provides mechanisms for merging changes, reverting changes, and branching and merging of code. Examples include Git, Mercurial, and Subversion.","title":"Clause 3: Terms and definitions"},{"location":"3.Terms_and_definitions/#3-terms-and-definitions","text":"For the purposes of this document, the following terms and definitions apply. ISO and IEC maintain terminological databases for use in standardization at the following addresses: ISO Online browsing platform: available at https://www.iso.org/obp IEC Electropedia: available at http://www.electropedia.org/","title":"3 Terms and definitions"},{"location":"3.Terms_and_definitions/#31-branch","text":"In the context of version control systems, a branch is a parallel line of development that stems from the main line (commonly known as the \"main\" or \"master\" branch). It allows developers to isolate their work for a particular feature or bug fix without affecting the main line of development. Once the work is complete and tested, it can be merged back into the main branch.","title":"3.1 branch"},{"location":"3.Terms_and_definitions/#32-git","text":"Git is a distributed version control system created by Linus Torvalds in 2005. It allows teams of programmers to work on the same code base without overwriting each other's changes. Git is known for its speed, data integrity, and support for distributed, non-linear workflows. Each Git directory on every computer is a full-fledged repository with complete history and version tracking abilities, independent of network access or a central server.","title":"3.2 git"},{"location":"3.Terms_and_definitions/#33-hierarchical-file-system","text":"A hierarchical file system is a method of organizing and managing files in a computer where data is stored hierarchically (in a structure often visualized as a tree). It uses directories (or 'folders') to organize files into a tree structure. Each directory can contain more files and directories, thus forming a hierarchical structure.","title":"3.3 hierarchical file system"},{"location":"3.Terms_and_definitions/#34-intrinsic-identifier","text":"An identifier that can be computed directly from the object that it identifies, without needing a registry. Typical examples are cryptographically strong hashes.","title":"3.4 intrinsic identifier"},{"location":"3.Terms_and_definitions/#35-repository","text":"In the context of version control systems, a repository is a storage location for software development artifacts including but not limited to source code, build scripts, documentation, etc. It often includes metadata about the stored items, such as version number, author, date of the last modification, etc. Repositories can be local or remote and are managed by version control systems like Git.","title":"3.5 repository"},{"location":"3.Terms_and_definitions/#36-sha1","text":"SHA-1 (short for \"Secure Hash Algorithm 1\", also stylized as \" SHA1 \") is a hash function that takes as input a sequence of bytes and produces a 160-bit (20-byte) hash value. The returned value is called SHA1 checksum , or simply SHA1 when there is no risk of ambiguity between the function and the returned value. A detailed description of how to compute SHA1 is available in RFC-3174. In the wake of the Shattered attack of 2017 (see paper: Stevens2017Shattered ), it is now possible to produce collision-prone files that are different but return the same SHA1 checksums. It is however possible to detect, during SHA1 computation, such SHA1-colliding files using counter-cryptanalysis (see paper: Stevens2013Counter ). As collision-prone files are problematic from the point of view of unequivocal identification and integrity verification, the SWHID standard takes measures to avoid that such files are referenced using only SHA1 checksums. For the purpose of this specification document, the SHA1 function is therefore considered to be a partial function, that only returns a value when a Shattered-style collision is not detectable using the techniques described in Stevens2013Counter and the reference implementation of it available at https://github.com/cr-marcstevens/sha1collisiondetection (version stable-v1.0.3 , corresponding to Git commit ID 38096fc021ac5b8f8207c7e926f11feb6b5eb17c ). When such a collision is detected during SHA1 computation, no SHA1 can be obtained for the object in question and hence, depending on the context, a valid SWHID might not exist for it. Note that in most cases SHA1 in this specification are computed on objects after adding specific headers to them, making \"trivial\" collision-prone files still perfectly valid and hence referenceable using SWHIDs.","title":"3.6 SHA1"},{"location":"3.Terms_and_definitions/#37-version-control-system","text":"A version control system (VCS), also known as source control or revision control, is a software tool that helps manage different versions of software development artifacts. It keeps track of all changes made to the code, allows multiple developers to work on the same codebase, and provides mechanisms for merging changes, reverting changes, and branching and merging of code. Examples include Git, Mercurial, and Subversion.","title":"3.7 version control system"},{"location":"4.Syntax/","text":"4 Syntax A SWHID consists of two separate parts, a mandatory core identifier that can identify any software artifact (or \"object\"), and an optional list of qualifiers that allows to specify the context where the object is meant to be seen and point to a subpart of the object itself. Syntactically, SWHIDs are generated by the <identifier> entry point in the following grammar: <identifier> ::= <core_identifier> [ <qualifiers> ] ; <core_identifier> ::= \"swh\" \":\" <scheme_version> \":\" <object_type> \":\" <object_id> ; <scheme_version> ::= \"1\" ; <object_type> ::= \"snp\" (* snapshot *) | \"rel\" (* release *) | \"rev\" (* revision *) | \"dir\" (* directory *) | \"cnt\" (* content *) ; <object_id> ::= 40 * <hex_digit> ; (* intrinsic object id, as hex-encoded SHA1 *) <dec_digit> ::= \"0\" | \"1\" | \"2\" | \"3\" | \"4\" | \"5\" | \"6\" | \"7\" | \"8\" | \"9\" ; <hex_digit> ::= <dec_digit> | \"a\" | \"b\" | \"c\" | \"d\" | \"e\" | \"f\" ; <qualifiers> ::= \";\" <qualifier> [ <qualifiers> ] ; <qualifier> ::= <context_qualifier> | <fragment_qualifier> ; <context_qualifier> ::= <origin_ctxt> | <visit_ctxt> | <anchor_ctxt> | <path_ctxt> ; <origin_ctxt> ::= \"origin\" \"=\" <url_escaped> ; <visit_ctxt> ::= \"visit\" \"=\" <identifier_core> ; <anchor_ctxt> ::= \"anchor\" \"=\" <identifier_core> ; <path_ctxt> ::= \"path\" \"=\" <path_absolute_escaped> ; <fragment_qualifier> ::= \"lines\" \"=\" <range> | \"bytes\" \"=\" <range> ; <range> ::= <number> [\"-\" <number>] ; <number> ::= <dec_digit> + ; <url_escaped> ::= (* RFC 3987 IRI *) <path_absolute_escaped> ::= (* RFC 3987 absolute path *) The last two symbols are defined as: <path_absolute_escaped> is an ipath-absolute from RFC-3987; and <url_escaped> is an IRI as defined in RFC-3987. In both of these, all occurrences of ; (and % , as required by the RFC) have been percent-encoded (as %3B and %25 respectively). Other characters may be percent-encoded, e.g., to improve readability and/or embeddability of SWHID in other contexts.","title":"Clause 4: Syntax"},{"location":"4.Syntax/#4-syntax","text":"A SWHID consists of two separate parts, a mandatory core identifier that can identify any software artifact (or \"object\"), and an optional list of qualifiers that allows to specify the context where the object is meant to be seen and point to a subpart of the object itself. Syntactically, SWHIDs are generated by the <identifier> entry point in the following grammar: <identifier> ::= <core_identifier> [ <qualifiers> ] ; <core_identifier> ::= \"swh\" \":\" <scheme_version> \":\" <object_type> \":\" <object_id> ; <scheme_version> ::= \"1\" ; <object_type> ::= \"snp\" (* snapshot *) | \"rel\" (* release *) | \"rev\" (* revision *) | \"dir\" (* directory *) | \"cnt\" (* content *) ; <object_id> ::= 40 * <hex_digit> ; (* intrinsic object id, as hex-encoded SHA1 *) <dec_digit> ::= \"0\" | \"1\" | \"2\" | \"3\" | \"4\" | \"5\" | \"6\" | \"7\" | \"8\" | \"9\" ; <hex_digit> ::= <dec_digit> | \"a\" | \"b\" | \"c\" | \"d\" | \"e\" | \"f\" ; <qualifiers> ::= \";\" <qualifier> [ <qualifiers> ] ; <qualifier> ::= <context_qualifier> | <fragment_qualifier> ; <context_qualifier> ::= <origin_ctxt> | <visit_ctxt> | <anchor_ctxt> | <path_ctxt> ; <origin_ctxt> ::= \"origin\" \"=\" <url_escaped> ; <visit_ctxt> ::= \"visit\" \"=\" <identifier_core> ; <anchor_ctxt> ::= \"anchor\" \"=\" <identifier_core> ; <path_ctxt> ::= \"path\" \"=\" <path_absolute_escaped> ; <fragment_qualifier> ::= \"lines\" \"=\" <range> | \"bytes\" \"=\" <range> ; <range> ::= <number> [\"-\" <number>] ; <number> ::= <dec_digit> + ; <url_escaped> ::= (* RFC 3987 IRI *) <path_absolute_escaped> ::= (* RFC 3987 absolute path *) The last two symbols are defined as: <path_absolute_escaped> is an ipath-absolute from RFC-3987; and <url_escaped> is an IRI as defined in RFC-3987. In both of these, all occurrences of ; (and % , as required by the RFC) have been percent-encoded (as %3B and %25 respectively). Other characters may be percent-encoded, e.g., to improve readability and/or embeddability of SWHID in other contexts.","title":"4 Syntax"},{"location":"5.Core_identifiers/","text":"5 Core identifiers A core SWHID identifier is composed of four fields, separated by a colon : . The first field is the type of the identifier and it is defined to be swh . The second field is the version of the identifier scheme and for this version of the specification it is defined to be 1 . The third field is a tag corresponding to the type of object identified: cnt for contents (see 5.1) dir for directories (see 5.2) rev for revisions (see 5.3) rel for releases (see 5.4) snp for snapshots (see 5.5) The fourth field is the intrinsic identifier of the object. This is a hex-encoded (using lowercase ASCII characters) hash value computed by the content and relevant metadata of the object. 5.1 Contents A content is an uninterpreted byte sequence, typically, the content of a file. For this type of object the intrinsic identifier is the sha1_git hash of it, i.e. the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"blob\" (4 bytes), an ASCII space, the length of the content as ASCII-encoded decimal digits, a NULL byte, and the actual content of the file. No metadata is used for this type of object (in particular, notice that there is no file name mentioned here). As an example, swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 is the SWHID computed from the full text of the GPL3 license 5.2 Directories Directories are data structures commonly used in hierarchical file systems to group together files and other directories, and to hold relevant metadata about them, in the form of directory entries. This version of the SWHID standard adopts the same convention as the popular git version control system, and only takes into account as metadata the name of the directory entries (as a sequence of arbitrary bytes, excluding ASCII '/' and the NULL byte) and a simplified representation of the access rights. The names of entries in a directory must be distinct from one another. In order to compute the intrinsic identifier of a directory, it is necessary to compute first the SWHID of each object listed in the directory. Then one proceeds to create a serialization of the directory as follows: sort the directory entries using the following algorithm for each entry pointing to a directory , append an ASCII '/' to its name sort all entries using the natural byte order of their (modified) name for each entry, with a given name (unmodified), add a sequence of bytes composed of the normalized access rights, encoded as a sequence of ASCII-encoded octal digits ('100644' for regular files, '100755' for executable files, '120000' for symbolic links, '40000' for directories), an ASCII space, the name as a raw string of bytes, a NULL byte, the intrinsic identifier of the content or directory , encoded as a sequence of 20 bytes. The intrinsic identifier of the directory is the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"tree\" (4 bytes), an ASCII space, the length of the previously obtained serialization as ASCII-encoded decimal digits, a NULL byte, and the previously obtained serialization. As an example, swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505 is the is the SWHID computed from a directory containing the source code of the darktable photography application as a given point in time of its development on May 4th 2017. 5.3 Revisions Software development within a specific project is essentially a time-indexed series of copies of a single \u201croot\u201d directory that contains the entire project source code. Software evolves when a developer modifies the content of one or more files in that directory and record their changes. Each recorded copy of the root directory is known as a \u201crevision\u201d. It points to a single fully-determined directory and is equipped with arbitrary metadata. Some of those are added manually by the developer (e.g., revision message), others are automatically synthesized (timestamps, parent revision(s), etc). The supported metadata is as follows: author (arbitrary byte sequence, mandatory): generally contains the name and email address of the author of the revision. author timestamp (decimal timestamp from the Unix epoch, mandatory): the date at which the revision was authored. author timezone offset (arbitrary byte sequence): UTC offset at which the revision was authored, usually an ASCII-encoded [+/-]HHMM specification. committer (arbitrary byte sequence, mandatory): generally contains the name and email address of the committer of the revision. committer timestamp (decimal timestamp from the Unix epoch, mandatory): the date at which the revision was committed. committer timezone offset (arbitrary byte sequence): UTC offset at which the revision was committed, usually an ASCII-encoded [+/-]HHMM specification. directory (mandatory): the root directory recorded by the revision parent revisions (ordered list of revisions): the immediately preceding revisions in the development timeline. Can be empty for an initial revision, and have multiple revisions when multiple branches of history are being merged. extra headers (ordered list of byte key/value pairs): arbitrary additional metadata attached to the revision. The key must not contain the ASCII bytes for the space or LF characters; commonly used keys are a string of non-whitespace printable ASCII characters, such as \"encoding\" (where the value is interpreted as the encoding of the message field) or \"gpgsig\" (where the value is interpreted as an OpenPGP signature of the metadata of the revision). message: the message describing the revision In order to compute the intrinsic identifier of a revision, it is necessary to first compute the intrinsic identifier of the root directory recorded by the revision, as well as the intrinsic identifier of all parent revisions (recursively). The serialization of the revision is a sequence of lines in the following order: the reference to the root directory: the ASCII string \"tree\" (4 bytes), an ASCII space, the ASCII-encoded hexadecimal intrinsic identifier of the directory (40 ASCII bytes), a LF; for each parent revision, in the order they've been provided, a reference to that revision: the ASCII string \"parent\" (6 bytes), an ASCII space, the ASCII-encoded hexadecimal intrinsic identifier of the parent revision (40 ASCII bytes), a LF; the author line: the ASCII string \"author\" (6 bytes), an ASCII space, the string of bytes provided for the author name and email, with each LF replaced by LF followed by an ASCII space, an ASCII space, the ASCII-encoded decimal representation of the author timestamp, an ASCII space, the string of bytes provided for the author timezone offset, with each LF replaced by LF followed by an ASCII space, a LF; the committer line: the ASCII string \"committer\" (9 bytes), an ASCII space, the string of bytes provided for the committer name and email, with each LF replaced by LF followed by an ASCII space, an ASCII space, the ASCII-encoded decimal representation of the committer timestamp, an ASCII space, the string of bytes provided for the committer timezone offset, with each LF replaced by LF followed by an ASCII space, a LF; the extra header lines; for each provided key/value pair, in the order they have been provided: the key, an ASCII space, the value, with each LF replaced by LF followed by an ASCII space, a LF; if the message is defined: an extra LF (the message is separated from the header with two LFs), the commit message as a raw string of bytes. The intrinsic identifier of the revision is the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"commit\" (6 bytes), an ASCII space, the length of the previously obtained serialization as ASCII-encoded decimal digits, a NULL byte, and the previously obtained serialization. As an example, swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d is the SWHID computed from a commit in the development history of Darktable , dated 16 January 2017, that added undo/redo supports for masks. 5.4 Releases Some revisions get selected by developers as denoting important project milestones known as \u201creleases\u201d. Each release points to the last commit in project history corresponding to the release and carries metadata: release name and version, release message, cryptographic signatures, etc. If they're not attached to development history (e.g. if they've been imported from bare tarballs), releases can also point directly to a root directory instead of a full revision with metadata. The supported metadata is as follows: - name (arbitrary byte sequence, mandatory): a name identifying the release - author (arbitrary byte sequence): generally contains the name and email address of the author of the release. - author timestamp (decimal timestamp from the Unix epoch): the date at which the release was authored. - author timezone offset (arbitrary byte sequence): UTC offset at which the release was authored, usually an ASCII-encoded [+/-]HHMM specification. - target object (mandatory): a reference to another object, which can be either a revision, a directory or less commonly a content or another release - message: the message describing the release In order to compute the intrinsic identifier of a release, it is necessary to first compute the intrinsic identifier of the targeted object. The serialization of the release is a sequence of lines in the following order: the reference to the target object: the ASCII string \"object\" (6 bytes) an ASCII space the ASCII-encoded hexadecimal intrinsic identifier of the target object (40 ASCII bytes) a LF the ASCII string \"type\" (4 bytes) an ASCII space an ASCII string referencing the type of the target object ( \"commit\" for a revision, \"tree\" for a directory, \"tag\" for another release, \"blob\" for a content object) a LF the name of the release: the ASCII string \"tag\" (3 bytes) an ASCII space the string of bytes provided for the release name, with each LF replaced by LF followed by an ASCII space a LF if there is an author, the author line: the ASCII string \"tagger\" (6 bytes) an ASCII space the string of bytes provided for the author name and email, with each LF replaced by LF followed by an ASCII space an ASCII space the ASCII-encoded decimal representation of the author timestamp an ASCII space the string of bytes provided for the author timezone offset, with each LF replaced by LF followed by an ASCII space a LF if the message is defined: an extra LF (the message is separated from the header with two LFs) the commit message as a raw string of bytes The intrinsic identifier of the release is the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"tag\" (3 bytes), an ASCII space, the length of the previously obtained serialization as ASCII-encoded decimal digits, a NULL byte, and the previously obtained serialization. As an example, swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f is the SWHID computed from the Darktable release 2.3.0 , dated 24 December 2016. 5.5 Snapshots Any kind of software origin offers multiple pointers to the \u201ccurrent\u201d state of a development project. In the case of VCS this is reflected by branches (e.g., master, development, but also so called feature branches dedicated to extending the software in a specific direction); in the case of package distributions by notions such as suites that correspond to different maturity levels of individual packages (e.g., stable, development, etc.). A \u201csnapshot\u201d of a given software origin records all entry points found there and where each of them was pointing at the time. For example, a snapshot object might track the commit where the master branch was pointing to at any given time, as well as the most recent release of a given package in the stable suite of a FOSS distribution. Practically, a snapshot is a list of named branches pointing at objects of any of the known types (content, directory, revision, release or snapshot). A branch can also be an alias to another (named) branch, for instance the default \"HEAD\" branch can point at another, more specific, \"refs/heads/main\" branch. To compute the intrinsic identifier of a snapshot, one must first compute the intrinsic identifier of all objects referenced by the snapshot. Then one proceeds to create a serialization of the snapshot as follows: sort the snapshot branches using the natural byte order of their name for each branch, with a given name , add a sequence of bytes composed of the type of the branch target: \"content\" , \"directory\" , \"revision\" , \"release\" or \"snapshot\" for each corresponding object type \"alias\" for branches referencing another branch; an ASCII space the branch name (as raw bytes) a NULL byte the length of the target identifier, as an ascii-encoded decimal number ( \"20\" for intrinsic identifiers, the length of the name of the target branch for branch aliases) an ASCII colon ( \":\" ) the identifier of the target object pointed at by the branch: for contents, directories, revisions, releases or snapshots: their intrinsic identifier as a string of 20 bytes for branch aliases, the name of the target branch (as a string of bytes) for dangling branches, the empty string Note that, akin to the serialization of directories, there is no separator between entries. Because of alias branches, target identifiers are of arbitrary length and are length-encoded to avoid ambiguity. The intrinsic identifier of the snapshot is the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"snapshot\" (8 bytes), an ASCII space, the length of the previously obtained serialization as ASCII-encoded decimal digits, a NULL byte, and the previously obtained serialization. As an example, swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453 is the SWHID computed from a snapshot of the entire Darktable Git repository as it was on 4 May 2017 on GitHub. Note on compatibility with Git SWHIDs for contents, directories, revisions, and releases are, at present, compatible with the way the current version of Git proceeds for computing identifiers for its objects. The <object_id> part of a SWHID for a content object is the Git blob identifier of any file with the same content; for a revision it is the Git commit identifier for the same revision, etc. This is not the case for snapshot identifiers, as Git does not have a corresponding object type. Git compatibility is practical, but incidental and is not guaranteed to be maintained in future versions of this standard, nor for different versions of Git.","title":"Clause 5: Core Identifiers"},{"location":"5.Core_identifiers/#5-core-identifiers","text":"A core SWHID identifier is composed of four fields, separated by a colon : . The first field is the type of the identifier and it is defined to be swh . The second field is the version of the identifier scheme and for this version of the specification it is defined to be 1 . The third field is a tag corresponding to the type of object identified: cnt for contents (see 5.1) dir for directories (see 5.2) rev for revisions (see 5.3) rel for releases (see 5.4) snp for snapshots (see 5.5) The fourth field is the intrinsic identifier of the object. This is a hex-encoded (using lowercase ASCII characters) hash value computed by the content and relevant metadata of the object.","title":"5 Core identifiers"},{"location":"5.Core_identifiers/#51-contents","text":"A content is an uninterpreted byte sequence, typically, the content of a file. For this type of object the intrinsic identifier is the sha1_git hash of it, i.e. the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"blob\" (4 bytes), an ASCII space, the length of the content as ASCII-encoded decimal digits, a NULL byte, and the actual content of the file. No metadata is used for this type of object (in particular, notice that there is no file name mentioned here). As an example, swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 is the SWHID computed from the full text of the GPL3 license","title":"5.1 Contents"},{"location":"5.Core_identifiers/#52-directories","text":"Directories are data structures commonly used in hierarchical file systems to group together files and other directories, and to hold relevant metadata about them, in the form of directory entries. This version of the SWHID standard adopts the same convention as the popular git version control system, and only takes into account as metadata the name of the directory entries (as a sequence of arbitrary bytes, excluding ASCII '/' and the NULL byte) and a simplified representation of the access rights. The names of entries in a directory must be distinct from one another. In order to compute the intrinsic identifier of a directory, it is necessary to compute first the SWHID of each object listed in the directory. Then one proceeds to create a serialization of the directory as follows: sort the directory entries using the following algorithm for each entry pointing to a directory , append an ASCII '/' to its name sort all entries using the natural byte order of their (modified) name for each entry, with a given name (unmodified), add a sequence of bytes composed of the normalized access rights, encoded as a sequence of ASCII-encoded octal digits ('100644' for regular files, '100755' for executable files, '120000' for symbolic links, '40000' for directories), an ASCII space, the name as a raw string of bytes, a NULL byte, the intrinsic identifier of the content or directory , encoded as a sequence of 20 bytes. The intrinsic identifier of the directory is the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"tree\" (4 bytes), an ASCII space, the length of the previously obtained serialization as ASCII-encoded decimal digits, a NULL byte, and the previously obtained serialization. As an example, swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505 is the is the SWHID computed from a directory containing the source code of the darktable photography application as a given point in time of its development on May 4th 2017.","title":"5.2 Directories"},{"location":"5.Core_identifiers/#53-revisions","text":"Software development within a specific project is essentially a time-indexed series of copies of a single \u201croot\u201d directory that contains the entire project source code. Software evolves when a developer modifies the content of one or more files in that directory and record their changes. Each recorded copy of the root directory is known as a \u201crevision\u201d. It points to a single fully-determined directory and is equipped with arbitrary metadata. Some of those are added manually by the developer (e.g., revision message), others are automatically synthesized (timestamps, parent revision(s), etc). The supported metadata is as follows: author (arbitrary byte sequence, mandatory): generally contains the name and email address of the author of the revision. author timestamp (decimal timestamp from the Unix epoch, mandatory): the date at which the revision was authored. author timezone offset (arbitrary byte sequence): UTC offset at which the revision was authored, usually an ASCII-encoded [+/-]HHMM specification. committer (arbitrary byte sequence, mandatory): generally contains the name and email address of the committer of the revision. committer timestamp (decimal timestamp from the Unix epoch, mandatory): the date at which the revision was committed. committer timezone offset (arbitrary byte sequence): UTC offset at which the revision was committed, usually an ASCII-encoded [+/-]HHMM specification. directory (mandatory): the root directory recorded by the revision parent revisions (ordered list of revisions): the immediately preceding revisions in the development timeline. Can be empty for an initial revision, and have multiple revisions when multiple branches of history are being merged. extra headers (ordered list of byte key/value pairs): arbitrary additional metadata attached to the revision. The key must not contain the ASCII bytes for the space or LF characters; commonly used keys are a string of non-whitespace printable ASCII characters, such as \"encoding\" (where the value is interpreted as the encoding of the message field) or \"gpgsig\" (where the value is interpreted as an OpenPGP signature of the metadata of the revision). message: the message describing the revision In order to compute the intrinsic identifier of a revision, it is necessary to first compute the intrinsic identifier of the root directory recorded by the revision, as well as the intrinsic identifier of all parent revisions (recursively). The serialization of the revision is a sequence of lines in the following order: the reference to the root directory: the ASCII string \"tree\" (4 bytes), an ASCII space, the ASCII-encoded hexadecimal intrinsic identifier of the directory (40 ASCII bytes), a LF; for each parent revision, in the order they've been provided, a reference to that revision: the ASCII string \"parent\" (6 bytes), an ASCII space, the ASCII-encoded hexadecimal intrinsic identifier of the parent revision (40 ASCII bytes), a LF; the author line: the ASCII string \"author\" (6 bytes), an ASCII space, the string of bytes provided for the author name and email, with each LF replaced by LF followed by an ASCII space, an ASCII space, the ASCII-encoded decimal representation of the author timestamp, an ASCII space, the string of bytes provided for the author timezone offset, with each LF replaced by LF followed by an ASCII space, a LF; the committer line: the ASCII string \"committer\" (9 bytes), an ASCII space, the string of bytes provided for the committer name and email, with each LF replaced by LF followed by an ASCII space, an ASCII space, the ASCII-encoded decimal representation of the committer timestamp, an ASCII space, the string of bytes provided for the committer timezone offset, with each LF replaced by LF followed by an ASCII space, a LF; the extra header lines; for each provided key/value pair, in the order they have been provided: the key, an ASCII space, the value, with each LF replaced by LF followed by an ASCII space, a LF; if the message is defined: an extra LF (the message is separated from the header with two LFs), the commit message as a raw string of bytes. The intrinsic identifier of the revision is the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"commit\" (6 bytes), an ASCII space, the length of the previously obtained serialization as ASCII-encoded decimal digits, a NULL byte, and the previously obtained serialization. As an example, swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d is the SWHID computed from a commit in the development history of Darktable , dated 16 January 2017, that added undo/redo supports for masks.","title":"5.3 Revisions"},{"location":"5.Core_identifiers/#54-releases","text":"Some revisions get selected by developers as denoting important project milestones known as \u201creleases\u201d. Each release points to the last commit in project history corresponding to the release and carries metadata: release name and version, release message, cryptographic signatures, etc. If they're not attached to development history (e.g. if they've been imported from bare tarballs), releases can also point directly to a root directory instead of a full revision with metadata. The supported metadata is as follows: - name (arbitrary byte sequence, mandatory): a name identifying the release - author (arbitrary byte sequence): generally contains the name and email address of the author of the release. - author timestamp (decimal timestamp from the Unix epoch): the date at which the release was authored. - author timezone offset (arbitrary byte sequence): UTC offset at which the release was authored, usually an ASCII-encoded [+/-]HHMM specification. - target object (mandatory): a reference to another object, which can be either a revision, a directory or less commonly a content or another release - message: the message describing the release In order to compute the intrinsic identifier of a release, it is necessary to first compute the intrinsic identifier of the targeted object. The serialization of the release is a sequence of lines in the following order: the reference to the target object: the ASCII string \"object\" (6 bytes) an ASCII space the ASCII-encoded hexadecimal intrinsic identifier of the target object (40 ASCII bytes) a LF the ASCII string \"type\" (4 bytes) an ASCII space an ASCII string referencing the type of the target object ( \"commit\" for a revision, \"tree\" for a directory, \"tag\" for another release, \"blob\" for a content object) a LF the name of the release: the ASCII string \"tag\" (3 bytes) an ASCII space the string of bytes provided for the release name, with each LF replaced by LF followed by an ASCII space a LF if there is an author, the author line: the ASCII string \"tagger\" (6 bytes) an ASCII space the string of bytes provided for the author name and email, with each LF replaced by LF followed by an ASCII space an ASCII space the ASCII-encoded decimal representation of the author timestamp an ASCII space the string of bytes provided for the author timezone offset, with each LF replaced by LF followed by an ASCII space a LF if the message is defined: an extra LF (the message is separated from the header with two LFs) the commit message as a raw string of bytes The intrinsic identifier of the release is the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"tag\" (3 bytes), an ASCII space, the length of the previously obtained serialization as ASCII-encoded decimal digits, a NULL byte, and the previously obtained serialization. As an example, swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f is the SWHID computed from the Darktable release 2.3.0 , dated 24 December 2016.","title":"5.4 Releases"},{"location":"5.Core_identifiers/#55-snapshots","text":"Any kind of software origin offers multiple pointers to the \u201ccurrent\u201d state of a development project. In the case of VCS this is reflected by branches (e.g., master, development, but also so called feature branches dedicated to extending the software in a specific direction); in the case of package distributions by notions such as suites that correspond to different maturity levels of individual packages (e.g., stable, development, etc.). A \u201csnapshot\u201d of a given software origin records all entry points found there and where each of them was pointing at the time. For example, a snapshot object might track the commit where the master branch was pointing to at any given time, as well as the most recent release of a given package in the stable suite of a FOSS distribution. Practically, a snapshot is a list of named branches pointing at objects of any of the known types (content, directory, revision, release or snapshot). A branch can also be an alias to another (named) branch, for instance the default \"HEAD\" branch can point at another, more specific, \"refs/heads/main\" branch. To compute the intrinsic identifier of a snapshot, one must first compute the intrinsic identifier of all objects referenced by the snapshot. Then one proceeds to create a serialization of the snapshot as follows: sort the snapshot branches using the natural byte order of their name for each branch, with a given name , add a sequence of bytes composed of the type of the branch target: \"content\" , \"directory\" , \"revision\" , \"release\" or \"snapshot\" for each corresponding object type \"alias\" for branches referencing another branch; an ASCII space the branch name (as raw bytes) a NULL byte the length of the target identifier, as an ascii-encoded decimal number ( \"20\" for intrinsic identifiers, the length of the name of the target branch for branch aliases) an ASCII colon ( \":\" ) the identifier of the target object pointed at by the branch: for contents, directories, revisions, releases or snapshots: their intrinsic identifier as a string of 20 bytes for branch aliases, the name of the target branch (as a string of bytes) for dangling branches, the empty string Note that, akin to the serialization of directories, there is no separator between entries. Because of alias branches, target identifiers are of arbitrary length and are length-encoded to avoid ambiguity. The intrinsic identifier of the snapshot is the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"snapshot\" (8 bytes), an ASCII space, the length of the previously obtained serialization as ASCII-encoded decimal digits, a NULL byte, and the previously obtained serialization. As an example, swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453 is the SWHID computed from a snapshot of the entire Darktable Git repository as it was on 4 May 2017 on GitHub.","title":"5.5 Snapshots"},{"location":"5.Core_identifiers/#note-on-compatibility-with-git","text":"SWHIDs for contents, directories, revisions, and releases are, at present, compatible with the way the current version of Git proceeds for computing identifiers for its objects. The <object_id> part of a SWHID for a content object is the Git blob identifier of any file with the same content; for a revision it is the Git commit identifier for the same revision, etc. This is not the case for snapshot identifiers, as Git does not have a corresponding object type. Git compatibility is practical, but incidental and is not guaranteed to be maintained in future versions of this standard, nor for different versions of Git.","title":"Note on compatibility with Git"},{"location":"6.Qualified_identifiers/","text":"6 Qualified identifiers Qualifiers A qualified, or full, SWHID is composed of a core SWHID identifier, and a sequence of qualifiers. Qualifiers may be: fragment qualifiers (see 6.1), that identify subparts of a software artifact; or context qualifiers (see 6.2), that provide additional context on the software artifact. Each qualifier is specified as a key-value pair, using an = character as a separator. Qualifiers are separated from the core identifier and from each other by using a ; character. Some qualifiers are valid for specific object types, and the validity of some qualifiers depends on the presence of other qualifiers. Conformant implementation MUST not generate invalid qualifiers or qualifier combinations and MUST ignore them if present, as detailed in the following sections. 6.1 Fragment qualifiers There are two fragment qualifiers, lines and bytes . Each fragment qualifier MUST appear at most once. Fragment qualifiers are only valid for objects of type content. Each valid SWHID must have at most one fragment qualifier. A conformant implementation MAY accept a SWHID that violates this constraint, by ignoring the lines qualifier when the bytes qualifier is present. 6.1.1 Lines qualifier A \"line\" in the context of a file content refers to a sequence of characters that ends with a line break. This line can contain text, code, or any other form of data. In this specification, the line break is the ASCII LF character. The \"lines\" qualifier allows to designate a line range inside a content. The range can be a single line number, or a pair of line numbers separated by the ASCII - character. Line numbers start from 1, and the range is inclusive, i.e. the fragment includes both the lines numbered as the start and end of the range. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;lines=9-15 designates the function generate_input_stream that is found at lines 9 to 15 of the content with core SWHID swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b . Notice that the notion of \"line number\" is not always meaningful: the content may be a binary file, or a file that uses non standard line termination character(s). 6.1.2 Bytes qualifier To overcome the limitations of the lines qualifier, the bytes qualifier allows designation of a byte range inside a content. The range can be a single byte number, or a pair of byte numbers separated by - . Byte numbers start from 0, and the range is inclusive, i.e. the fragment includes both the bytes numbered as the start and end of the range. If the range is a single byte number, it designates the byte at that specific position. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;bytes=154-315 designates the same function generate_input_stream as in the example above, but does not rely on any convention about line numbers. 6.2 Context qualifiers There are four context qualifiers, origin , visit , path and anchor . Each context qualifier MUST appear at most once. 6.2.1 Origin qualifier This qualifier allows declaration of the software origin where the object has been found or observed, as an URI. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git indicates that the content seen previously with the function generate_input_stream has been seen in the Git repository at https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git This qualifier may be helpful to get hold of the full repository where a content has been found, but there is no guarantee of success, as an origin can change or disappear over time (as is the case in the example above, since gitorious.org was shut down in 2015). 6.2.2 Visit qualifier This qualifier allows addition of the core SWHID identifier of the snapshot of the repository where the object has been found or observed. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9 indicates that the content seen previously with the function generate_input_stream has been seen in the Git repository at https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git , when its full state had the SWHID core identifier swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9 . This qualifier is only valid when the origin qualifier is also present. Otherwise, it MUST be ignored. 6.2.3 Path qualifier This qualifier allows declaration of the absolute file path , from the root directory associated to the anchor node , to the object designated by the core SWHID identifier; when the anchor denotes a directory, a revision or a release, the root directory is uniquely determined; when the anchor denotes a snapshot, the root directory is the first directory reachable from the HEAD branch, and undefined if such a reference is missing. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml indicates that the content seen previously with the function generate_input_stream has been seen in the Git repository at https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git , when its full state had the SWHID core identifier swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9 , and that it is named simplefarm.ml in the directory Simplefarm contained in the directory Examples contained in the root directory associated to the revision with core SWHID swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0 . This qualifier is only valid when the object type is not content. Otherwise, it MUST be ignored. 6.2.4 Anchor qualifier This qualifier is used in conjunction with the path qualifier. It allows identification of a node in the Merkle DAG relative to which a path to the object is specified, as the core identifier of a directory, a revision, a release or a snapshot. See the example provided for the path qualifier. This qualifier is only valid when the path qualifier is also present. Otherwise, it MUST be ignored. 6.3 Comparing qualified SWHIDs One can determine whether two software artifacts are identical (bit by bit) by comparing their core SWHIDs, ignoring all qualifiers. If the core SWHIDs are equal, the software artifacts they represent are identical. To determine if two SWHIDs represent the same software artifact (or fragment thereof) in the same context, one must also compare their qualifiers. Two SWHIDs are considered equivalent in context if: They both have the same set of qualifiers. The values of these qualifiers are identical. For instance, if both SWHIDs have an anchor qualifier, the core SWHID values of these qualifiers are identical. Similarly, if both have a lines qualifier, their values are identical. Note that the order of the qualifiers does not matter for comparison purposes. 6.4 Recommendations We recommend equipping identifiers meant for sharing with as many qualifiers as possible. While qualifiers may be listed in any order, it is good practice to present them in the following canonical order: origin , visit , anchor , path , lines or bytes . By adhering to this order, it becomes easier to visually inspect and compare SWHIDs, especially when dealing with a large number of identifiers. Here is an example: swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15 Redundant information should be omitted: for example, if the visit is present, and the path is relative to the snapshot indicated there, then the anchor qualifier is superfluous; similarly, if the path is empty, it may be omitted.","title":"Clause 6: Qualified Identifiers"},{"location":"6.Qualified_identifiers/#6-qualified-identifiers","text":"","title":"6 Qualified identifiers"},{"location":"6.Qualified_identifiers/#qualifiers","text":"A qualified, or full, SWHID is composed of a core SWHID identifier, and a sequence of qualifiers. Qualifiers may be: fragment qualifiers (see 6.1), that identify subparts of a software artifact; or context qualifiers (see 6.2), that provide additional context on the software artifact. Each qualifier is specified as a key-value pair, using an = character as a separator. Qualifiers are separated from the core identifier and from each other by using a ; character. Some qualifiers are valid for specific object types, and the validity of some qualifiers depends on the presence of other qualifiers. Conformant implementation MUST not generate invalid qualifiers or qualifier combinations and MUST ignore them if present, as detailed in the following sections.","title":"Qualifiers"},{"location":"6.Qualified_identifiers/#61-fragment-qualifiers","text":"There are two fragment qualifiers, lines and bytes . Each fragment qualifier MUST appear at most once. Fragment qualifiers are only valid for objects of type content. Each valid SWHID must have at most one fragment qualifier. A conformant implementation MAY accept a SWHID that violates this constraint, by ignoring the lines qualifier when the bytes qualifier is present.","title":"6.1 Fragment qualifiers"},{"location":"6.Qualified_identifiers/#611-lines-qualifier","text":"A \"line\" in the context of a file content refers to a sequence of characters that ends with a line break. This line can contain text, code, or any other form of data. In this specification, the line break is the ASCII LF character. The \"lines\" qualifier allows to designate a line range inside a content. The range can be a single line number, or a pair of line numbers separated by the ASCII - character. Line numbers start from 1, and the range is inclusive, i.e. the fragment includes both the lines numbered as the start and end of the range. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;lines=9-15 designates the function generate_input_stream that is found at lines 9 to 15 of the content with core SWHID swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b . Notice that the notion of \"line number\" is not always meaningful: the content may be a binary file, or a file that uses non standard line termination character(s).","title":"6.1.1 Lines qualifier"},{"location":"6.Qualified_identifiers/#612-bytes-qualifier","text":"To overcome the limitations of the lines qualifier, the bytes qualifier allows designation of a byte range inside a content. The range can be a single byte number, or a pair of byte numbers separated by - . Byte numbers start from 0, and the range is inclusive, i.e. the fragment includes both the bytes numbered as the start and end of the range. If the range is a single byte number, it designates the byte at that specific position. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;bytes=154-315 designates the same function generate_input_stream as in the example above, but does not rely on any convention about line numbers.","title":"6.1.2 Bytes qualifier"},{"location":"6.Qualified_identifiers/#62-context-qualifiers","text":"There are four context qualifiers, origin , visit , path and anchor . Each context qualifier MUST appear at most once.","title":"6.2 Context qualifiers"},{"location":"6.Qualified_identifiers/#621-origin-qualifier","text":"This qualifier allows declaration of the software origin where the object has been found or observed, as an URI. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git indicates that the content seen previously with the function generate_input_stream has been seen in the Git repository at https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git This qualifier may be helpful to get hold of the full repository where a content has been found, but there is no guarantee of success, as an origin can change or disappear over time (as is the case in the example above, since gitorious.org was shut down in 2015).","title":"6.2.1 Origin qualifier"},{"location":"6.Qualified_identifiers/#622-visit-qualifier","text":"This qualifier allows addition of the core SWHID identifier of the snapshot of the repository where the object has been found or observed. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9 indicates that the content seen previously with the function generate_input_stream has been seen in the Git repository at https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git , when its full state had the SWHID core identifier swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9 . This qualifier is only valid when the origin qualifier is also present. Otherwise, it MUST be ignored.","title":"6.2.2 Visit qualifier"},{"location":"6.Qualified_identifiers/#623-path-qualifier","text":"This qualifier allows declaration of the absolute file path , from the root directory associated to the anchor node , to the object designated by the core SWHID identifier; when the anchor denotes a directory, a revision or a release, the root directory is uniquely determined; when the anchor denotes a snapshot, the root directory is the first directory reachable from the HEAD branch, and undefined if such a reference is missing. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml indicates that the content seen previously with the function generate_input_stream has been seen in the Git repository at https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git , when its full state had the SWHID core identifier swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9 , and that it is named simplefarm.ml in the directory Simplefarm contained in the directory Examples contained in the root directory associated to the revision with core SWHID swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0 . This qualifier is only valid when the object type is not content. Otherwise, it MUST be ignored.","title":"6.2.3 Path qualifier"},{"location":"6.Qualified_identifiers/#624-anchor-qualifier","text":"This qualifier is used in conjunction with the path qualifier. It allows identification of a node in the Merkle DAG relative to which a path to the object is specified, as the core identifier of a directory, a revision, a release or a snapshot. See the example provided for the path qualifier. This qualifier is only valid when the path qualifier is also present. Otherwise, it MUST be ignored.","title":"6.2.4 Anchor qualifier"},{"location":"6.Qualified_identifiers/#63-comparing-qualified-swhids","text":"One can determine whether two software artifacts are identical (bit by bit) by comparing their core SWHIDs, ignoring all qualifiers. If the core SWHIDs are equal, the software artifacts they represent are identical. To determine if two SWHIDs represent the same software artifact (or fragment thereof) in the same context, one must also compare their qualifiers. Two SWHIDs are considered equivalent in context if: They both have the same set of qualifiers. The values of these qualifiers are identical. For instance, if both SWHIDs have an anchor qualifier, the core SWHID values of these qualifiers are identical. Similarly, if both have a lines qualifier, their values are identical. Note that the order of the qualifiers does not matter for comparison purposes.","title":"6.3 Comparing qualified SWHIDs"},{"location":"6.Qualified_identifiers/#64-recommendations","text":"We recommend equipping identifiers meant for sharing with as many qualifiers as possible. While qualifiers may be listed in any order, it is good practice to present them in the following canonical order: origin , visit , anchor , path , lines or bytes . By adhering to this order, it becomes easier to visually inspect and compare SWHIDs, especially when dealing with a large number of identifiers. Here is an example: swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15 Redundant information should be omitted: for example, if the visit is present, and the path is relative to the snapshot indicated there, then the anchor qualifier is superfluous; similarly, if the path is empty, it may be omitted.","title":"6.4 Recommendations"},{"location":"A.Conformance/","text":"Annex A Conformance (Informative) A.1 Current and Previous Versions This edition has the version number 1.1 as part of its title. Version 1.0 was the first edition of the SWHID Specification as a Publicly Available Standard, and earlier editions of the specification were published by the Software Heritage. Differences between this edition and earlier ones are reported in the text; see also [1] . A.2 Obsolete features Over the life of a standard, some older approaches can become obsolete and are dropped from subsequent editions, possibly with a replacement approach being provided. Such action involves deprecating those outdated features. This edition identifies all currently deprecated features.","title":"Annex A: Conformance"},{"location":"A.Conformance/#annex-a-conformance-informative","text":"","title":"Annex A Conformance (Informative)"},{"location":"A.Conformance/#a1-current-and-previous-versions","text":"This edition has the version number 1.1 as part of its title. Version 1.0 was the first edition of the SWHID Specification as a Publicly Available Standard, and earlier editions of the specification were published by the Software Heritage. Differences between this edition and earlier ones are reported in the text; see also [1] .","title":"A.1 Current and Previous Versions"},{"location":"A.Conformance/#a2-obsolete-features","text":"Over the life of a standard, some older approaches can become obsolete and are dropped from subsequent editions, possibly with a replacement approach being provided. Such action involves deprecating those outdated features. This edition identifies all currently deprecated features.","title":"A.2 Obsolete features"},{"location":"B.Bibliography/","text":"Annex B Bibliography (Informative) The following documents are useful references for implementers and users of this document: [1] SoftWare Heritage persistent IDentifiers ; SoftWare Heritage, https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html [Stevens2013Counter] Marc Stevens. Counter-cryptanalysis. In Advances in Cryptology, CRYPTO 2013: 33rd Annual Cryptology Conference, Santa Barbara, CA, USA, August 18-22, 2013. Proceedings, Part I (pp. 129-146). Springer Berlin Heidelberg. Open access preprint: https://eprint.iacr.org/2013/358 [Stevens2017Shattered] Marc Stevens, Elie Bursztein, Pierre Karpman, Ange Albertini, Yarik Markov. The First Collision for Full SHA-1. In Advances in Cryptology, CRYPTO 2017: 37th Annual International Cryptology Conference, Santa Barbara, CA, USA, August 20\u201324, 2017, Proceedings, Part I 37 (pp. 570-596). Springer International Publishing. Open access preprint: https://eprint.iacr.org/2017/190","title":"Annex B: Bibliography"},{"location":"B.Bibliography/#annex-b-bibliography-informative","text":"The following documents are useful references for implementers and users of this document: [1] SoftWare Heritage persistent IDentifiers ; SoftWare Heritage, https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html [Stevens2013Counter] Marc Stevens. Counter-cryptanalysis. In Advances in Cryptology, CRYPTO 2013: 33rd Annual Cryptology Conference, Santa Barbara, CA, USA, August 18-22, 2013. Proceedings, Part I (pp. 129-146). Springer Berlin Heidelberg. Open access preprint: https://eprint.iacr.org/2013/358 [Stevens2017Shattered] Marc Stevens, Elie Bursztein, Pierre Karpman, Ange Albertini, Yarik Markov. The First Collision for Full SHA-1. In Advances in Cryptology, CRYPTO 2017: 37th Annual International Cryptology Conference, Santa Barbara, CA, USA, August 20\u201324, 2017, Proceedings, Part I 37 (pp. 570-596). Springer International Publishing. Open access preprint: https://eprint.iacr.org/2017/190","title":"Annex B Bibliography (Informative)"}]}
\ No newline at end of file
+{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"The SWHID Specification Version 1.1 Copyright \u00a9 2022-2023 SWHID Contributors. This work is licensed under the Community Specification License 1.0. With thanks to Alexios Zavras, Jean-Francois Abramatic, Roberto Di Cosmo, and Stefano Zacchiroli for their contributions and assistance.","title":"Copyright"},{"location":"#the-swhid-specification-version-11","text":"Copyright \u00a9 2022-2023 SWHID Contributors. This work is licensed under the Community Specification License 1.0. With thanks to Alexios Zavras, Jean-Francois Abramatic, Roberto Di Cosmo, and Stefano Zacchiroli for their contributions and assistance.","title":"The SWHID Specification Version 1.1"},{"location":"0.Foreword/","text":"Foreword ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that committee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization. The procedures used to develop this document and those intended for its further maintenance are described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types of ISO documents should be noted. This document was drafted in accordance with the editorial rules of the ISO/IEC Directives, Part 2 (see https://www.iso.org/directives ). Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of any patent rights identified during the development of the document will be in the Introduction and/or on the ISO list of patent declarations received (see https://www.iso.org/patents ). Any trade name used in this document is information given for the convenience of users and does not constitute an endorsement. For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions related to conformity assessment, as well as information about ISO's adherence to the World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see https://www.iso.org/iso/foreword.html . This document was prepared by XXX. Any feedback or questions on this document should be directed to the user's national standards body. A complete listing of these bodies can be found at https://www.iso.org/members.html .","title":"Foreword"},{"location":"0.Foreword/#foreword","text":"ISO (the International Organization for Standardization) is a worldwide federation of national standards bodies (ISO member bodies). The work of preparing International Standards is normally carried out through ISO technical committees. Each member body interested in a subject for which a technical committee has been established has the right to be represented on that committee. International organizations, governmental and non-governmental, in liaison with ISO, also take part in the work. ISO collaborates closely with the International Electrotechnical Commission (IEC) on all matters of electrotechnical standardization. The procedures used to develop this document and those intended for its further maintenance are described in the ISO/IEC Directives, Part 1. In particular, the different approval criteria needed for the different types of ISO documents should be noted. This document was drafted in accordance with the editorial rules of the ISO/IEC Directives, Part 2 (see https://www.iso.org/directives ). Attention is drawn to the possibility that some of the elements of this document may be the subject of patent rights. ISO shall not be held responsible for identifying any or all such patent rights. Details of any patent rights identified during the development of the document will be in the Introduction and/or on the ISO list of patent declarations received (see https://www.iso.org/patents ). Any trade name used in this document is information given for the convenience of users and does not constitute an endorsement. For an explanation of the voluntary nature of standards, the meaning of ISO specific terms and expressions related to conformity assessment, as well as information about ISO's adherence to the World Trade Organization (WTO) principles in the Technical Barriers to Trade (TBT), see https://www.iso.org/iso/foreword.html . This document was prepared by XXX. Any feedback or questions on this document should be directed to the user's national standards body. A complete listing of these bodies can be found at https://www.iso.org/members.html .","title":"Foreword"},{"location":"0.Introduction/","text":"Introduction Modern software relies heavily on open source components that are developed collaboratively in a distributed setting, and that are assembled to create complex systems that evolve at a fast pace. This has strengthened the need to precisely track, ensure availability, and guarantee integrity of the components that go into a given system for a variety of stakeholders. Academia needs to ensure that research results are reproducible, industry needs to improve the traceability of the software supply chain, developer communities need tools to cope with the increasing complexity. A key building block for addressing this issue is a system of intrinsic identifiers that allows to precisely pinpoint the exact version of any software artifact, at all levels of granularity, without relying on any central registry or naming authority. With this specification, the SWHID working group makes such a system of intrinsic identifiers, originally developed for the Software Heritage universal source code archive, available to all stakeholders. For the sake of clarity, we will use examples drawn directly from the Software Heritage archive, but notice that systems for persistent archival of software artifacts, as well as resolution of SWHIDs are out of the scope of this specification, and the SWHID specification does not require in any way the use of Software Heritage.","title":"Introduction"},{"location":"0.Introduction/#introduction","text":"Modern software relies heavily on open source components that are developed collaboratively in a distributed setting, and that are assembled to create complex systems that evolve at a fast pace. This has strengthened the need to precisely track, ensure availability, and guarantee integrity of the components that go into a given system for a variety of stakeholders. Academia needs to ensure that research results are reproducible, industry needs to improve the traceability of the software supply chain, developer communities need tools to cope with the increasing complexity. A key building block for addressing this issue is a system of intrinsic identifiers that allows to precisely pinpoint the exact version of any software artifact, at all levels of granularity, without relying on any central registry or naming authority. With this specification, the SWHID working group makes such a system of intrinsic identifiers, originally developed for the Software Heritage universal source code archive, available to all stakeholders. For the sake of clarity, we will use examples drawn directly from the Software Heritage archive, but notice that systems for persistent archival of software artifacts, as well as resolution of SWHIDs are out of the scope of this specification, and the SWHID specification does not require in any way the use of Software Heritage.","title":"Introduction"},{"location":"1.Scope/","text":"1 Scope This SoftWare Hash IDentifier (SWHID) specification defines a standard data format for referencing digital artifacts that fit in the data model of modern distributed version control systems. This includes the typical tree-like structure of a filesystem hierarchy, but also special nodes to track revisions and releases, as well as the full status of a version control system, with all its development branches. A key property of SWHIDs is that they can be computed using cryptographically strong functions directly from the digital objects they refer to, by anyone that has access to a copy of them. This enables decentralised and independent verification of integrity, without relying on a registry or a central authority. The computation of the SWHID identifiers is based on Merkle Acyclic Directed Graphs, a natural generalization of Merkle trees. The resolution of SWHIDs, i.e. the process of obtaining a copy of a digital artifact corresponding to a given SWHID, is out of the scope of this specification.","title":"Clause 1: Scope"},{"location":"1.Scope/#1-scope","text":"This SoftWare Hash IDentifier (SWHID) specification defines a standard data format for referencing digital artifacts that fit in the data model of modern distributed version control systems. This includes the typical tree-like structure of a filesystem hierarchy, but also special nodes to track revisions and releases, as well as the full status of a version control system, with all its development branches. A key property of SWHIDs is that they can be computed using cryptographically strong functions directly from the digital objects they refer to, by anyone that has access to a copy of them. This enables decentralised and independent verification of integrity, without relying on a registry or a central authority. The computation of the SWHID identifiers is based on Merkle Acyclic Directed Graphs, a natural generalization of Merkle trees. The resolution of SWHIDs, i.e. the process of obtaining a copy of a digital artifact corresponding to a given SWHID, is out of the scope of this specification.","title":"1 Scope"},{"location":"2.Normative_references/","text":"2 Normative references The following documents are referred to in the text in such a way that some or all of their content constitutes requirements of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies. RFC-3174, US Secure Hash Algorithm 1 (SHA1) , The Internet Society Network Working Group, https://tools.ietf.org/html/rfc3174 RFC-3986, Uniform Resource Identifier (URI): Generic Syntax , The Internet Society Network Working Group, https://tools.ietf.org/html/rfc3986 RFC-3987, Internationalized Resource Identifiers (IRIs) , The Internet Society Network Working Group, https://tools.ietf.org/html/rfc3987 RFC-5234, Augmented BNF for Syntax Specifications: ABNF , The Internet Society Network Working Group, https://tools.ietf.org/html/rfc5234","title":"Clause 2: Normative references"},{"location":"2.Normative_references/#2-normative-references","text":"The following documents are referred to in the text in such a way that some or all of their content constitutes requirements of this document. For dated references, only the edition cited applies. For undated references, the latest edition of the referenced document (including any amendments) applies. RFC-3174, US Secure Hash Algorithm 1 (SHA1) , The Internet Society Network Working Group, https://tools.ietf.org/html/rfc3174 RFC-3986, Uniform Resource Identifier (URI): Generic Syntax , The Internet Society Network Working Group, https://tools.ietf.org/html/rfc3986 RFC-3987, Internationalized Resource Identifiers (IRIs) , The Internet Society Network Working Group, https://tools.ietf.org/html/rfc3987 RFC-5234, Augmented BNF for Syntax Specifications: ABNF , The Internet Society Network Working Group, https://tools.ietf.org/html/rfc5234","title":"2 Normative references"},{"location":"3.Terms_and_definitions/","text":"3 Terms and definitions For the purposes of this document, the following terms and definitions apply. ISO and IEC maintain terminological databases for use in standardization at the following addresses: ISO Online browsing platform: available at https://www.iso.org/obp IEC Electropedia: available at http://www.electropedia.org/ 3.1 branch In the context of version control systems, a branch is a parallel line of development that stems from the main line (commonly known as the \"main\" or \"master\" branch). It allows developers to isolate their work for a particular feature or bug fix without affecting the main line of development. Once the work is complete and tested, it can be merged back into the main branch. 3.2 git Git is a distributed version control system created by Linus Torvalds in 2005. It allows teams of programmers to work on the same code base without overwriting each other's changes. Git is known for its speed, data integrity, and support for distributed, non-linear workflows. Each Git directory on every computer is a full-fledged repository with complete history and version tracking abilities, independent of network access or a central server. 3.3 hierarchical file system A hierarchical file system is a method of organizing and managing files in a computer where data is stored hierarchically (in a structure often visualized as a tree). It uses directories (or 'folders') to organize files into a tree structure. Each directory can contain more files and directories, thus forming a hierarchical structure. 3.4 intrinsic identifier An identifier that can be computed directly from the object that it identifies, without needing a registry. Typical examples are cryptographically strong hashes. 3.5 repository In the context of version control systems, a repository is a storage location for software development artifacts including but not limited to source code, build scripts, documentation, etc. It often includes metadata about the stored items, such as version number, author, date of the last modification, etc. Repositories can be local or remote and are managed by version control systems like Git. 3.6 SHA1 SHA-1 (short for \"Secure Hash Algorithm 1\", also stylized as \" SHA1 \") is a hash function that takes as input a sequence of bytes and produces a 160-bit (20-byte) hash value. The returned value is called SHA1 checksum , or simply SHA1 when there is no risk of ambiguity between the function and the returned value. A detailed description of how to compute SHA1 is available in RFC-3174. In the wake of the Shattered attack of 2017 (see paper: Stevens2017Shattered ), it is now possible to produce collision-prone files that are different but return the same SHA1 checksums. It is however possible to detect, during SHA1 computation, such SHA1-colliding files using counter-cryptanalysis (see paper: Stevens2013Counter ). As collision-prone files are problematic from the point of view of unequivocal identification and integrity verification, the SWHID standard takes measures to avoid that such files are referenced using only SHA1 checksums. For the purpose of this specification document, the SHA1 function is therefore considered to be a partial function, that only returns a value when a Shattered-style collision is not detectable using the techniques described in Stevens2013Counter and the reference implementation of it available at https://github.com/cr-marcstevens/sha1collisiondetection (version stable-v1.0.3 , corresponding to Git commit ID 38096fc021ac5b8f8207c7e926f11feb6b5eb17c ). When such a collision is detected during SHA1 computation, no SHA1 can be obtained for the object in question and hence, depending on the context, a valid SWHID might not exist for it. Note that in most cases SHA1 in this specification are computed on objects after adding specific headers to them, making \"trivial\" collision-prone files still perfectly valid and hence referenceable using SWHIDs. 3.7 version control system A version control system (VCS), also known as source control or revision control, is a software tool that helps manage different versions of software development artifacts. It keeps track of all changes made to the code, allows multiple developers to work on the same codebase, and provides mechanisms for merging changes, reverting changes, and branching and merging of code. Examples include Git, Mercurial, and Subversion.","title":"Clause 3: Terms and definitions"},{"location":"3.Terms_and_definitions/#3-terms-and-definitions","text":"For the purposes of this document, the following terms and definitions apply. ISO and IEC maintain terminological databases for use in standardization at the following addresses: ISO Online browsing platform: available at https://www.iso.org/obp IEC Electropedia: available at http://www.electropedia.org/","title":"3 Terms and definitions"},{"location":"3.Terms_and_definitions/#31-branch","text":"In the context of version control systems, a branch is a parallel line of development that stems from the main line (commonly known as the \"main\" or \"master\" branch). It allows developers to isolate their work for a particular feature or bug fix without affecting the main line of development. Once the work is complete and tested, it can be merged back into the main branch.","title":"3.1 branch"},{"location":"3.Terms_and_definitions/#32-git","text":"Git is a distributed version control system created by Linus Torvalds in 2005. It allows teams of programmers to work on the same code base without overwriting each other's changes. Git is known for its speed, data integrity, and support for distributed, non-linear workflows. Each Git directory on every computer is a full-fledged repository with complete history and version tracking abilities, independent of network access or a central server.","title":"3.2 git"},{"location":"3.Terms_and_definitions/#33-hierarchical-file-system","text":"A hierarchical file system is a method of organizing and managing files in a computer where data is stored hierarchically (in a structure often visualized as a tree). It uses directories (or 'folders') to organize files into a tree structure. Each directory can contain more files and directories, thus forming a hierarchical structure.","title":"3.3 hierarchical file system"},{"location":"3.Terms_and_definitions/#34-intrinsic-identifier","text":"An identifier that can be computed directly from the object that it identifies, without needing a registry. Typical examples are cryptographically strong hashes.","title":"3.4 intrinsic identifier"},{"location":"3.Terms_and_definitions/#35-repository","text":"In the context of version control systems, a repository is a storage location for software development artifacts including but not limited to source code, build scripts, documentation, etc. It often includes metadata about the stored items, such as version number, author, date of the last modification, etc. Repositories can be local or remote and are managed by version control systems like Git.","title":"3.5 repository"},{"location":"3.Terms_and_definitions/#36-sha1","text":"SHA-1 (short for \"Secure Hash Algorithm 1\", also stylized as \" SHA1 \") is a hash function that takes as input a sequence of bytes and produces a 160-bit (20-byte) hash value. The returned value is called SHA1 checksum , or simply SHA1 when there is no risk of ambiguity between the function and the returned value. A detailed description of how to compute SHA1 is available in RFC-3174. In the wake of the Shattered attack of 2017 (see paper: Stevens2017Shattered ), it is now possible to produce collision-prone files that are different but return the same SHA1 checksums. It is however possible to detect, during SHA1 computation, such SHA1-colliding files using counter-cryptanalysis (see paper: Stevens2013Counter ). As collision-prone files are problematic from the point of view of unequivocal identification and integrity verification, the SWHID standard takes measures to avoid that such files are referenced using only SHA1 checksums. For the purpose of this specification document, the SHA1 function is therefore considered to be a partial function, that only returns a value when a Shattered-style collision is not detectable using the techniques described in Stevens2013Counter and the reference implementation of it available at https://github.com/cr-marcstevens/sha1collisiondetection (version stable-v1.0.3 , corresponding to Git commit ID 38096fc021ac5b8f8207c7e926f11feb6b5eb17c ). When such a collision is detected during SHA1 computation, no SHA1 can be obtained for the object in question and hence, depending on the context, a valid SWHID might not exist for it. Note that in most cases SHA1 in this specification are computed on objects after adding specific headers to them, making \"trivial\" collision-prone files still perfectly valid and hence referenceable using SWHIDs.","title":"3.6 SHA1"},{"location":"3.Terms_and_definitions/#37-version-control-system","text":"A version control system (VCS), also known as source control or revision control, is a software tool that helps manage different versions of software development artifacts. It keeps track of all changes made to the code, allows multiple developers to work on the same codebase, and provides mechanisms for merging changes, reverting changes, and branching and merging of code. Examples include Git, Mercurial, and Subversion.","title":"3.7 version control system"},{"location":"4.Syntax/","text":"4 Syntax A SWHID consists of two separate parts, a mandatory core identifier that can identify any software artifact (or \"object\"), and an optional list of qualifiers that allows specification of the context where the object is meant to be seen and points to a subpart of the object itself. Syntactically, SWHIDs are generated by the <identifier> entry point in the following grammar: <identifier> ::= <core_identifier> [ <qualifiers> ] ; <core_identifier> ::= \"swh\" \":\" <scheme_version> \":\" <object_type> \":\" <object_id> ; <scheme_version> ::= \"1\" ; <object_type> ::= \"snp\" (* snapshot *) | \"rel\" (* release *) | \"rev\" (* revision *) | \"dir\" (* directory *) | \"cnt\" (* content *) ; <object_id> ::= 40 * <hex_digit> ; (* intrinsic object id, as hex-encoded SHA1 *) <dec_digit> ::= \"0\" | \"1\" | \"2\" | \"3\" | \"4\" | \"5\" | \"6\" | \"7\" | \"8\" | \"9\" ; <hex_digit> ::= <dec_digit> | \"a\" | \"b\" | \"c\" | \"d\" | \"e\" | \"f\" ; <qualifiers> ::= \";\" <qualifier> [ <qualifiers> ] ; <qualifier> ::= <context_qualifier> | <fragment_qualifier> ; <context_qualifier> ::= <origin_ctxt> | <visit_ctxt> | <anchor_ctxt> | <path_ctxt> ; <origin_ctxt> ::= \"origin\" \"=\" <url_escaped> ; <visit_ctxt> ::= \"visit\" \"=\" <identifier_core> ; <anchor_ctxt> ::= \"anchor\" \"=\" <identifier_core> ; <path_ctxt> ::= \"path\" \"=\" <path_absolute_escaped> ; <fragment_qualifier> ::= \"lines\" \"=\" <range> | \"bytes\" \"=\" <range> ; <range> ::= <number> [\"-\" <number>] ; <number> ::= <dec_digit> + ; <url_escaped> ::= (* RFC 3987 IRI *) <path_absolute_escaped> ::= (* RFC 3987 absolute path *) The last two symbols are defined as: <path_absolute_escaped> is an ipath-absolute from RFC-3987; and <url_escaped> is an IRI as defined in RFC-3987. In both of these, all occurrences of ; (and % , as required by the RFC) have been percent-encoded (as %3B and %25 respectively). Other characters may be percent-encoded, e.g., to improve readability and/or embeddability of SWHID in other contexts.","title":"Clause 4: Syntax"},{"location":"4.Syntax/#4-syntax","text":"A SWHID consists of two separate parts, a mandatory core identifier that can identify any software artifact (or \"object\"), and an optional list of qualifiers that allows specification of the context where the object is meant to be seen and points to a subpart of the object itself. Syntactically, SWHIDs are generated by the <identifier> entry point in the following grammar: <identifier> ::= <core_identifier> [ <qualifiers> ] ; <core_identifier> ::= \"swh\" \":\" <scheme_version> \":\" <object_type> \":\" <object_id> ; <scheme_version> ::= \"1\" ; <object_type> ::= \"snp\" (* snapshot *) | \"rel\" (* release *) | \"rev\" (* revision *) | \"dir\" (* directory *) | \"cnt\" (* content *) ; <object_id> ::= 40 * <hex_digit> ; (* intrinsic object id, as hex-encoded SHA1 *) <dec_digit> ::= \"0\" | \"1\" | \"2\" | \"3\" | \"4\" | \"5\" | \"6\" | \"7\" | \"8\" | \"9\" ; <hex_digit> ::= <dec_digit> | \"a\" | \"b\" | \"c\" | \"d\" | \"e\" | \"f\" ; <qualifiers> ::= \";\" <qualifier> [ <qualifiers> ] ; <qualifier> ::= <context_qualifier> | <fragment_qualifier> ; <context_qualifier> ::= <origin_ctxt> | <visit_ctxt> | <anchor_ctxt> | <path_ctxt> ; <origin_ctxt> ::= \"origin\" \"=\" <url_escaped> ; <visit_ctxt> ::= \"visit\" \"=\" <identifier_core> ; <anchor_ctxt> ::= \"anchor\" \"=\" <identifier_core> ; <path_ctxt> ::= \"path\" \"=\" <path_absolute_escaped> ; <fragment_qualifier> ::= \"lines\" \"=\" <range> | \"bytes\" \"=\" <range> ; <range> ::= <number> [\"-\" <number>] ; <number> ::= <dec_digit> + ; <url_escaped> ::= (* RFC 3987 IRI *) <path_absolute_escaped> ::= (* RFC 3987 absolute path *) The last two symbols are defined as: <path_absolute_escaped> is an ipath-absolute from RFC-3987; and <url_escaped> is an IRI as defined in RFC-3987. In both of these, all occurrences of ; (and % , as required by the RFC) have been percent-encoded (as %3B and %25 respectively). Other characters may be percent-encoded, e.g., to improve readability and/or embeddability of SWHID in other contexts.","title":"4 Syntax"},{"location":"5.Core_identifiers/","text":"5 Core identifiers A core SWHID identifier is composed of four fields, separated by a colon : . The first field is the type of the identifier and it is defined to be swh . The second field is the version of the identifier scheme and for this version of the specification it is defined to be 1 . The third field is a tag corresponding to the type of object identified: cnt for contents (see 5.1) dir for directories (see 5.2) rev for revisions (see 5.3) rel for releases (see 5.4) snp for snapshots (see 5.5) The fourth field is the intrinsic identifier of the object. This is a hex-encoded (using lowercase ASCII characters) hash value computed from the content and relevant metadata of the object. 5.1 Contents A content is an uninterpreted byte sequence, typically, the content of a file. For this type of object the intrinsic identifier is the sha1_git hash of it, i.e. the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"blob\" (4 bytes), an ASCII space, the length of the content as ASCII-encoded decimal digits, a NULL byte, and the actual content of the file. No metadata is used for this type of object (in particular, notice that there is no file name mentioned here). As an example, swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 is the SWHID computed from the full text of the GPL3 license 5.2 Directories Directories are data structures commonly used in hierarchical file systems to group together files and other directories, and to hold relevant metadata about them, in the form of directory entries. This version of the SWHID standard adopts the same convention as the popular git version control system, and only takes into account as metadata the name of the directory entries (as a sequence of arbitrary bytes, excluding ASCII '/' and the NULL byte) and a simplified representation of the access rights. The names of entries in a directory must be distinct from one another. In order to compute the intrinsic identifier of a directory, it is necessary to compute first the SWHID of each object listed in the directory. Then one proceeds to create a serialization of the directory as follows: sort the directory entries using the following algorithm for each entry pointing to a directory , append an ASCII '/' to its name sort all entries using the natural byte order of their (modified) name for each entry, with a given name (unmodified), add a sequence of bytes composed of the normalized access rights, encoded as a sequence of ASCII-encoded octal digits ('100644' for regular files, '100755' for executable files, '120000' for symbolic links, '40000' for directories), an ASCII space, the name as a raw string of bytes, a NULL byte, the intrinsic identifier of the content or directory , encoded as a sequence of 20 bytes. The intrinsic identifier of the directory is the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"tree\" (4 bytes), an ASCII space, the length of the previously obtained serialization as ASCII-encoded decimal digits, a NULL byte, and the previously obtained serialization. As an example, swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505 is the SWHID computed from a directory containing the source code of the darktable photography application at a given point in time of its development on May 4th 2017. 5.3 Revisions Software development within a specific project is essentially a time-indexed series of copies of a single \u201croot\u201d directory that contains the entire project source code. Software evolves when a developer modifies the content of one or more files in that directory and records their changes. Each recorded copy of the root directory is known as a \u201crevision\u201d. It points to a single fully-determined directory and is equipped with arbitrary metadata. Some of those are added manually by the developer (e.g., revision message), others are automatically synthesized (timestamps, parent revision(s), etc). The supported metadata is as follows: author (arbitrary byte sequence, mandatory): generally contains the name and email address of the author of the revision. author timestamp (decimal timestamp from the Unix epoch, mandatory): the date at which the revision was authored. author timezone offset (arbitrary byte sequence): UTC offset at which the revision was authored, usually an ASCII-encoded [+/-]HHMM specification. committer (arbitrary byte sequence, mandatory): generally contains the name and email address of the committer of the revision. committer timestamp (decimal timestamp from the Unix epoch, mandatory): the date at which the revision was committed. committer timezone offset (arbitrary byte sequence): UTC offset at which the revision was committed, usually an ASCII-encoded [+/-]HHMM specification. directory (mandatory): the root directory recorded by the revision parent revisions (ordered list of revisions): the immediately preceding revisions in the development timeline. Can be empty for an initial revision, and have multiple revisions when multiple branches of history are being merged. extra headers (ordered list of byte key/value pairs): arbitrary additional metadata attached to the revision. The key must not contain the ASCII bytes for the space or LF characters; commonly used keys are a string of non-whitespace printable ASCII characters, such as \"encoding\" (where the value is interpreted as the encoding of the message field) or \"gpgsig\" (where the value is interpreted as an OpenPGP signature of the metadata of the revision). message: the message describing the revision In order to compute the intrinsic identifier of a revision, it is necessary to first compute the intrinsic identifier of the root directory recorded by the revision, as well as the intrinsic identifier of all parent revisions (recursively). The serialization of the revision is a sequence of lines in the following order: the reference to the root directory: the ASCII string \"tree\" (4 bytes), an ASCII space, the ASCII-encoded hexadecimal intrinsic identifier of the directory (40 ASCII bytes), a LF; for each parent revision, in the order they've been provided, a reference to that revision: the ASCII string \"parent\" (6 bytes), an ASCII space, the ASCII-encoded hexadecimal intrinsic identifier of the parent revision (40 ASCII bytes), a LF; the author line: the ASCII string \"author\" (6 bytes), an ASCII space, the string of bytes provided for the author name and email, with each LF replaced by LF followed by an ASCII space, an ASCII space, the ASCII-encoded decimal representation of the author timestamp, an ASCII space, the string of bytes provided for the author timezone offset, with each LF replaced by LF followed by an ASCII space, a LF; the committer line: the ASCII string \"committer\" (9 bytes), an ASCII space, the string of bytes provided for the committer name and email, with each LF replaced by LF followed by an ASCII space, an ASCII space, the ASCII-encoded decimal representation of the committer timestamp, an ASCII space, the string of bytes provided for the committer timezone offset, with each LF replaced by LF followed by an ASCII space, a LF; the extra header lines; for each provided key/value pair, in the order they have been provided: the key, an ASCII space, the value, with each LF replaced by LF followed by an ASCII space, a LF; if the message is defined: an extra LF (the message is separated from the header with two LFs), the commit message as a raw string of bytes. The intrinsic identifier of the revision is the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"commit\" (6 bytes), an ASCII space, the length of the previously obtained serialization as ASCII-encoded decimal digits, a NULL byte, and the previously obtained serialization. As an example, swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d is the SWHID computed from a commit in the development history of Darktable , dated 16 January 2017, that added undo/redo supports for masks. 5.4 Releases Some revisions get selected by developers as denoting important project milestones known as \u201creleases\u201d. Each release points to the last commit in project history corresponding to the release and carries metadata: release name and version, release message, cryptographic signatures, etc. If they're not attached to development history (e.g. if they've been imported from bare tarballs), releases can also point directly to a root directory instead of a full revision with metadata. The supported metadata is as follows: - name (arbitrary byte sequence, mandatory): a name identifying the release - author (arbitrary byte sequence): generally contains the name and email address of the author of the release. - author timestamp (decimal timestamp from the Unix epoch): the date at which the release was authored. - author timezone offset (arbitrary byte sequence): UTC offset at which the release was authored, usually an ASCII-encoded [+/-]HHMM specification. - target object (mandatory): a reference to another object, which can be either a revision, a directory or less commonly a content or another release - message: the message describing the release In order to compute the intrinsic identifier of a release, it is necessary to first compute the intrinsic identifier of the targeted object. The serialization of the release is a sequence of lines in the following order: the reference to the target object: the ASCII string \"object\" (6 bytes) an ASCII space the ASCII-encoded hexadecimal intrinsic identifier of the target object (40 ASCII bytes) a LF the ASCII string \"type\" (4 bytes) an ASCII space an ASCII string referencing the type of the target object ( \"commit\" for a revision, \"tree\" for a directory, \"tag\" for another release, \"blob\" for a content object) a LF the name of the release: the ASCII string \"tag\" (3 bytes) an ASCII space the string of bytes provided for the release name, with each LF replaced by LF followed by an ASCII space a LF if there is an author, the author line: the ASCII string \"tagger\" (6 bytes) an ASCII space the string of bytes provided for the author name and email, with each LF replaced by LF followed by an ASCII space an ASCII space the ASCII-encoded decimal representation of the author timestamp an ASCII space the string of bytes provided for the author timezone offset, with each LF replaced by LF followed by an ASCII space a LF if the message is defined: an extra LF (the message is separated from the header with two LFs) the commit message as a raw string of bytes The intrinsic identifier of the release is the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"tag\" (3 bytes), an ASCII space, the length of the previously obtained serialization as ASCII-encoded decimal digits, a NULL byte, and the previously obtained serialization. As an example, swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f is the SWHID computed from the Darktable release 2.3.0 , dated 24 December 2016. 5.5 Snapshots Any kind of software origin offers multiple pointers to the \u201ccurrent\u201d state of a development project. In the case of VCS this is reflected by branches (e.g., master, development, but also so called feature branches dedicated to extending the software in a specific direction); in the case of package distributions by notions such as suites that correspond to different maturity levels of individual packages (e.g., stable, development, etc.). A \u201csnapshot\u201d of a given software origin records all entry points found there and where each of them was pointing at the time. For example, a snapshot object might track the commit where the master branch was pointing to at any given time, as well as the most recent release of a given package in the stable suite of a FOSS distribution. Practically, a snapshot is a list of named branches pointing at objects of any of the known types (content, directory, revision, release or snapshot). A branch can also be an alias to another (named) branch, for instance the default \"HEAD\" branch can point at another, more specific, \"refs/heads/main\" branch. To compute the intrinsic identifier of a snapshot, one must first compute the intrinsic identifier of all objects referenced by the snapshot. Then one proceeds to create a serialization of the snapshot as follows: sort the snapshot branches using the natural byte order of their name for each branch, with a given name , add a sequence of bytes composed of the type of the branch target: \"content\" , \"directory\" , \"revision\" , \"release\" or \"snapshot\" for each corresponding object type \"alias\" for branches referencing another branch; an ASCII space the branch name (as raw bytes) a NULL byte the length of the target identifier, as an ascii-encoded decimal number ( \"20\" for intrinsic identifiers, the length of the name of the target branch for branch aliases) an ASCII colon ( \":\" ) the identifier of the target object pointed at by the branch: for contents, directories, revisions, releases or snapshots: their intrinsic identifier as a string of 20 bytes for branch aliases, the name of the target branch (as a string of bytes) for dangling branches, the empty string Note that, akin to the serialization of directories, there is no separator between entries. Because of alias branches, target identifiers are of arbitrary length and are length-encoded to avoid ambiguity. The intrinsic identifier of the snapshot is the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"snapshot\" (8 bytes), an ASCII space, the length of the previously obtained serialization as ASCII-encoded decimal digits, a NULL byte, and the previously obtained serialization. As an example, swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453 is the SWHID computed from a snapshot of the entire Darktable Git repository as it was on 4 May 2017 on GitHub. Note on compatibility with Git SWHIDs for contents, directories, revisions, and releases are, at present, compatible with the way the current version of Git proceeds for computing identifiers for its objects. The <object_id> part of a SWHID for a content object is the Git blob identifier of any file with the same content; for a revision it is the Git commit identifier for the same revision, etc. This is not the case for snapshot identifiers, as Git does not have a corresponding object type. Git compatibility is practical, but incidental and is not guaranteed to be maintained in future versions of this standard, nor for different versions of Git.","title":"Clause 5: Core Identifiers"},{"location":"5.Core_identifiers/#5-core-identifiers","text":"A core SWHID identifier is composed of four fields, separated by a colon : . The first field is the type of the identifier and it is defined to be swh . The second field is the version of the identifier scheme and for this version of the specification it is defined to be 1 . The third field is a tag corresponding to the type of object identified: cnt for contents (see 5.1) dir for directories (see 5.2) rev for revisions (see 5.3) rel for releases (see 5.4) snp for snapshots (see 5.5) The fourth field is the intrinsic identifier of the object. This is a hex-encoded (using lowercase ASCII characters) hash value computed from the content and relevant metadata of the object.","title":"5 Core identifiers"},{"location":"5.Core_identifiers/#51-contents","text":"A content is an uninterpreted byte sequence, typically, the content of a file. For this type of object the intrinsic identifier is the sha1_git hash of it, i.e. the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"blob\" (4 bytes), an ASCII space, the length of the content as ASCII-encoded decimal digits, a NULL byte, and the actual content of the file. No metadata is used for this type of object (in particular, notice that there is no file name mentioned here). As an example, swh:1:cnt:94a9ed024d3859793618152ea559a168bbcbb5e2 is the SWHID computed from the full text of the GPL3 license","title":"5.1 Contents"},{"location":"5.Core_identifiers/#52-directories","text":"Directories are data structures commonly used in hierarchical file systems to group together files and other directories, and to hold relevant metadata about them, in the form of directory entries. This version of the SWHID standard adopts the same convention as the popular git version control system, and only takes into account as metadata the name of the directory entries (as a sequence of arbitrary bytes, excluding ASCII '/' and the NULL byte) and a simplified representation of the access rights. The names of entries in a directory must be distinct from one another. In order to compute the intrinsic identifier of a directory, it is necessary to compute first the SWHID of each object listed in the directory. Then one proceeds to create a serialization of the directory as follows: sort the directory entries using the following algorithm for each entry pointing to a directory , append an ASCII '/' to its name sort all entries using the natural byte order of their (modified) name for each entry, with a given name (unmodified), add a sequence of bytes composed of the normalized access rights, encoded as a sequence of ASCII-encoded octal digits ('100644' for regular files, '100755' for executable files, '120000' for symbolic links, '40000' for directories), an ASCII space, the name as a raw string of bytes, a NULL byte, the intrinsic identifier of the content or directory , encoded as a sequence of 20 bytes. The intrinsic identifier of the directory is the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"tree\" (4 bytes), an ASCII space, the length of the previously obtained serialization as ASCII-encoded decimal digits, a NULL byte, and the previously obtained serialization. As an example, swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505 is the SWHID computed from a directory containing the source code of the darktable photography application at a given point in time of its development on May 4th 2017.","title":"5.2 Directories"},{"location":"5.Core_identifiers/#53-revisions","text":"Software development within a specific project is essentially a time-indexed series of copies of a single \u201croot\u201d directory that contains the entire project source code. Software evolves when a developer modifies the content of one or more files in that directory and records their changes. Each recorded copy of the root directory is known as a \u201crevision\u201d. It points to a single fully-determined directory and is equipped with arbitrary metadata. Some of those are added manually by the developer (e.g., revision message), others are automatically synthesized (timestamps, parent revision(s), etc). The supported metadata is as follows: author (arbitrary byte sequence, mandatory): generally contains the name and email address of the author of the revision. author timestamp (decimal timestamp from the Unix epoch, mandatory): the date at which the revision was authored. author timezone offset (arbitrary byte sequence): UTC offset at which the revision was authored, usually an ASCII-encoded [+/-]HHMM specification. committer (arbitrary byte sequence, mandatory): generally contains the name and email address of the committer of the revision. committer timestamp (decimal timestamp from the Unix epoch, mandatory): the date at which the revision was committed. committer timezone offset (arbitrary byte sequence): UTC offset at which the revision was committed, usually an ASCII-encoded [+/-]HHMM specification. directory (mandatory): the root directory recorded by the revision parent revisions (ordered list of revisions): the immediately preceding revisions in the development timeline. Can be empty for an initial revision, and have multiple revisions when multiple branches of history are being merged. extra headers (ordered list of byte key/value pairs): arbitrary additional metadata attached to the revision. The key must not contain the ASCII bytes for the space or LF characters; commonly used keys are a string of non-whitespace printable ASCII characters, such as \"encoding\" (where the value is interpreted as the encoding of the message field) or \"gpgsig\" (where the value is interpreted as an OpenPGP signature of the metadata of the revision). message: the message describing the revision In order to compute the intrinsic identifier of a revision, it is necessary to first compute the intrinsic identifier of the root directory recorded by the revision, as well as the intrinsic identifier of all parent revisions (recursively). The serialization of the revision is a sequence of lines in the following order: the reference to the root directory: the ASCII string \"tree\" (4 bytes), an ASCII space, the ASCII-encoded hexadecimal intrinsic identifier of the directory (40 ASCII bytes), a LF; for each parent revision, in the order they've been provided, a reference to that revision: the ASCII string \"parent\" (6 bytes), an ASCII space, the ASCII-encoded hexadecimal intrinsic identifier of the parent revision (40 ASCII bytes), a LF; the author line: the ASCII string \"author\" (6 bytes), an ASCII space, the string of bytes provided for the author name and email, with each LF replaced by LF followed by an ASCII space, an ASCII space, the ASCII-encoded decimal representation of the author timestamp, an ASCII space, the string of bytes provided for the author timezone offset, with each LF replaced by LF followed by an ASCII space, a LF; the committer line: the ASCII string \"committer\" (9 bytes), an ASCII space, the string of bytes provided for the committer name and email, with each LF replaced by LF followed by an ASCII space, an ASCII space, the ASCII-encoded decimal representation of the committer timestamp, an ASCII space, the string of bytes provided for the committer timezone offset, with each LF replaced by LF followed by an ASCII space, a LF; the extra header lines; for each provided key/value pair, in the order they have been provided: the key, an ASCII space, the value, with each LF replaced by LF followed by an ASCII space, a LF; if the message is defined: an extra LF (the message is separated from the header with two LFs), the commit message as a raw string of bytes. The intrinsic identifier of the revision is the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"commit\" (6 bytes), an ASCII space, the length of the previously obtained serialization as ASCII-encoded decimal digits, a NULL byte, and the previously obtained serialization. As an example, swh:1:rev:309cf2674ee7a0749978cf8265ab91a60aea0f7d is the SWHID computed from a commit in the development history of Darktable , dated 16 January 2017, that added undo/redo supports for masks.","title":"5.3 Revisions"},{"location":"5.Core_identifiers/#54-releases","text":"Some revisions get selected by developers as denoting important project milestones known as \u201creleases\u201d. Each release points to the last commit in project history corresponding to the release and carries metadata: release name and version, release message, cryptographic signatures, etc. If they're not attached to development history (e.g. if they've been imported from bare tarballs), releases can also point directly to a root directory instead of a full revision with metadata. The supported metadata is as follows: - name (arbitrary byte sequence, mandatory): a name identifying the release - author (arbitrary byte sequence): generally contains the name and email address of the author of the release. - author timestamp (decimal timestamp from the Unix epoch): the date at which the release was authored. - author timezone offset (arbitrary byte sequence): UTC offset at which the release was authored, usually an ASCII-encoded [+/-]HHMM specification. - target object (mandatory): a reference to another object, which can be either a revision, a directory or less commonly a content or another release - message: the message describing the release In order to compute the intrinsic identifier of a release, it is necessary to first compute the intrinsic identifier of the targeted object. The serialization of the release is a sequence of lines in the following order: the reference to the target object: the ASCII string \"object\" (6 bytes) an ASCII space the ASCII-encoded hexadecimal intrinsic identifier of the target object (40 ASCII bytes) a LF the ASCII string \"type\" (4 bytes) an ASCII space an ASCII string referencing the type of the target object ( \"commit\" for a revision, \"tree\" for a directory, \"tag\" for another release, \"blob\" for a content object) a LF the name of the release: the ASCII string \"tag\" (3 bytes) an ASCII space the string of bytes provided for the release name, with each LF replaced by LF followed by an ASCII space a LF if there is an author, the author line: the ASCII string \"tagger\" (6 bytes) an ASCII space the string of bytes provided for the author name and email, with each LF replaced by LF followed by an ASCII space an ASCII space the ASCII-encoded decimal representation of the author timestamp an ASCII space the string of bytes provided for the author timezone offset, with each LF replaced by LF followed by an ASCII space a LF if the message is defined: an extra LF (the message is separated from the header with two LFs) the commit message as a raw string of bytes The intrinsic identifier of the release is the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"tag\" (3 bytes), an ASCII space, the length of the previously obtained serialization as ASCII-encoded decimal digits, a NULL byte, and the previously obtained serialization. As an example, swh:1:rel:22ece559cc7cc2364edc5e5593d63ae8bd229f9f is the SWHID computed from the Darktable release 2.3.0 , dated 24 December 2016.","title":"5.4 Releases"},{"location":"5.Core_identifiers/#55-snapshots","text":"Any kind of software origin offers multiple pointers to the \u201ccurrent\u201d state of a development project. In the case of VCS this is reflected by branches (e.g., master, development, but also so called feature branches dedicated to extending the software in a specific direction); in the case of package distributions by notions such as suites that correspond to different maturity levels of individual packages (e.g., stable, development, etc.). A \u201csnapshot\u201d of a given software origin records all entry points found there and where each of them was pointing at the time. For example, a snapshot object might track the commit where the master branch was pointing to at any given time, as well as the most recent release of a given package in the stable suite of a FOSS distribution. Practically, a snapshot is a list of named branches pointing at objects of any of the known types (content, directory, revision, release or snapshot). A branch can also be an alias to another (named) branch, for instance the default \"HEAD\" branch can point at another, more specific, \"refs/heads/main\" branch. To compute the intrinsic identifier of a snapshot, one must first compute the intrinsic identifier of all objects referenced by the snapshot. Then one proceeds to create a serialization of the snapshot as follows: sort the snapshot branches using the natural byte order of their name for each branch, with a given name , add a sequence of bytes composed of the type of the branch target: \"content\" , \"directory\" , \"revision\" , \"release\" or \"snapshot\" for each corresponding object type \"alias\" for branches referencing another branch; an ASCII space the branch name (as raw bytes) a NULL byte the length of the target identifier, as an ascii-encoded decimal number ( \"20\" for intrinsic identifiers, the length of the name of the target branch for branch aliases) an ASCII colon ( \":\" ) the identifier of the target object pointed at by the branch: for contents, directories, revisions, releases or snapshots: their intrinsic identifier as a string of 20 bytes for branch aliases, the name of the target branch (as a string of bytes) for dangling branches, the empty string Note that, akin to the serialization of directories, there is no separator between entries. Because of alias branches, target identifiers are of arbitrary length and are length-encoded to avoid ambiguity. The intrinsic identifier of the snapshot is the SHA1 of the byte sequence obtained by juxtaposing the ASCII string \"snapshot\" (8 bytes), an ASCII space, the length of the previously obtained serialization as ASCII-encoded decimal digits, a NULL byte, and the previously obtained serialization. As an example, swh:1:snp:c7c108084bc0bf3d81436bf980b46e98bd338453 is the SWHID computed from a snapshot of the entire Darktable Git repository as it was on 4 May 2017 on GitHub.","title":"5.5 Snapshots"},{"location":"5.Core_identifiers/#note-on-compatibility-with-git","text":"SWHIDs for contents, directories, revisions, and releases are, at present, compatible with the way the current version of Git proceeds for computing identifiers for its objects. The <object_id> part of a SWHID for a content object is the Git blob identifier of any file with the same content; for a revision it is the Git commit identifier for the same revision, etc. This is not the case for snapshot identifiers, as Git does not have a corresponding object type. Git compatibility is practical, but incidental and is not guaranteed to be maintained in future versions of this standard, nor for different versions of Git.","title":"Note on compatibility with Git"},{"location":"6.Qualified_identifiers/","text":"6 Qualified identifiers Qualifiers A qualified, or full, SWHID is composed of a core SWHID identifier, and a sequence of qualifiers. Qualifiers may be: fragment qualifiers (see 6.1), that identify subparts of a software artifact; or context qualifiers (see 6.2), that provide additional context on the software artifact. Each qualifier is specified as a key-value pair, using an = character as a separator. Qualifiers are separated from the core identifier and from each other by using a ; character. Some qualifiers are valid for specific object types, and the validity of some qualifiers depends on the presence of other qualifiers. Conformant implementation MUST not generate invalid qualifiers or qualifier combinations and MUST ignore them if present, as detailed in the following sections. 6.1 Fragment qualifiers There are two fragment qualifiers, lines and bytes . Each fragment qualifier MUST appear at most once. Fragment qualifiers are only valid for objects of type content. Each valid SWHID must have at most one fragment qualifier. A conformant implementation MAY accept a SWHID that violates this constraint, by ignoring the lines qualifier when the bytes qualifier is present. 6.1.1 Lines qualifier A \"line\" in the context of a file content refers to a sequence of characters that ends with a line break. This line can contain text, code, or any other form of data. In this specification, the line break is the ASCII LF character. The \"lines\" qualifier allows to designate a line range inside a content. The range can be a single line number, or a pair of line numbers separated by the ASCII - character. Line numbers start from 1, and the range is inclusive, i.e. the fragment includes both the lines numbered as the start and end of the range. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;lines=9-15 designates the function generate_input_stream that is found at lines 9 to 15 of the content with core SWHID swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b . Notice that the notion of \"line number\" is not always meaningful: the content may be a binary file, or a file that uses non standard line termination character(s). 6.1.2 Bytes qualifier To overcome the limitations of the lines qualifier, the bytes qualifier allows designation of a byte range inside a content. The range can be a single byte number, or a pair of byte numbers separated by - . Byte numbers start from 0, and the range is inclusive, i.e. the fragment includes both the bytes numbered as the start and end of the range. If the range is a single byte number, it designates the byte at that specific position. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;bytes=154-315 designates the same function generate_input_stream as in the example above, but does not rely on any convention about line numbers. 6.2 Context qualifiers There are four context qualifiers, origin , visit , path and anchor . Each context qualifier MUST appear at most once. 6.2.1 Origin qualifier This qualifier allows declaration of the software origin where the object has been found or observed, as an URI. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git indicates that the content seen previously with the function generate_input_stream has been seen in the Git repository at https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git This qualifier may be helpful to get hold of the full repository where a content has been found, but there is no guarantee of success, as an origin can change or disappear over time (as is the case in the example above, since gitorious.org was shut down in 2015). 6.2.2 Visit qualifier This qualifier allows addition of the core SWHID identifier of the snapshot of the repository where the object has been found or observed. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9 indicates that the content seen previously with the function generate_input_stream has been seen in the Git repository at https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git , when its full state had the SWHID core identifier swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9 . This qualifier is only valid when the origin qualifier is also present. Otherwise, it MUST be ignored. 6.2.3 Path qualifier This qualifier allows declaration of the absolute file path , from the root directory associated to the anchor node , to the object designated by the core SWHID identifier; when the anchor denotes a directory, a revision or a release, the root directory is uniquely determined; when the anchor denotes a snapshot, the root directory is the first directory reachable from the HEAD branch, and undefined if such a reference is missing. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml indicates that the content seen previously with the function generate_input_stream has been seen in the Git repository at https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git , when its full state had the SWHID core identifier swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9 , and that it is named simplefarm.ml in the directory Simplefarm contained in the directory Examples contained in the root directory associated to the revision with core SWHID swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0 . This qualifier is only valid when the object type is not content. Otherwise, it MUST be ignored. 6.2.4 Anchor qualifier This qualifier is used in conjunction with the path qualifier. It allows identification of a node in the Merkle DAG relative to which a path to the object is specified, as the core identifier of a directory, a revision, a release or a snapshot. See the example provided for the path qualifier. This qualifier is only valid when the path qualifier is also present. Otherwise, it MUST be ignored. 6.3 Comparing qualified SWHIDs One can determine whether two software artifacts are identical (bit by bit) by comparing their core SWHIDs, ignoring all qualifiers. If the core SWHIDs are equal, the software artifacts they represent are identical. To determine if two SWHIDs represent the same software artifact (or fragment thereof) in the same context, one must also compare their qualifiers. Two SWHIDs are considered equivalent in context if: They both have the same set of qualifiers. The values of these qualifiers are identical. For instance, if both SWHIDs have an anchor qualifier, the core SWHID values of these qualifiers are identical. Similarly, if both have a lines qualifier, their values are identical. Note that the order of the qualifiers does not matter for comparison purposes. 6.4 Recommendations We recommend equipping identifiers meant for sharing with as many qualifiers as possible. While qualifiers may be listed in any order, it is good practice to present them in the following canonical order: origin , visit , anchor , path , lines or bytes . By adhering to this order, it becomes easier to visually inspect and compare SWHIDs, especially when dealing with a large number of identifiers. Here is an example: swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15 Redundant information should be omitted: for example, if the visit is present, and the path is relative to the snapshot indicated there, then the anchor qualifier is superfluous; similarly, if the path is empty, it may be omitted.","title":"Clause 6: Qualified Identifiers"},{"location":"6.Qualified_identifiers/#6-qualified-identifiers","text":"","title":"6 Qualified identifiers"},{"location":"6.Qualified_identifiers/#qualifiers","text":"A qualified, or full, SWHID is composed of a core SWHID identifier, and a sequence of qualifiers. Qualifiers may be: fragment qualifiers (see 6.1), that identify subparts of a software artifact; or context qualifiers (see 6.2), that provide additional context on the software artifact. Each qualifier is specified as a key-value pair, using an = character as a separator. Qualifiers are separated from the core identifier and from each other by using a ; character. Some qualifiers are valid for specific object types, and the validity of some qualifiers depends on the presence of other qualifiers. Conformant implementation MUST not generate invalid qualifiers or qualifier combinations and MUST ignore them if present, as detailed in the following sections.","title":"Qualifiers"},{"location":"6.Qualified_identifiers/#61-fragment-qualifiers","text":"There are two fragment qualifiers, lines and bytes . Each fragment qualifier MUST appear at most once. Fragment qualifiers are only valid for objects of type content. Each valid SWHID must have at most one fragment qualifier. A conformant implementation MAY accept a SWHID that violates this constraint, by ignoring the lines qualifier when the bytes qualifier is present.","title":"6.1 Fragment qualifiers"},{"location":"6.Qualified_identifiers/#611-lines-qualifier","text":"A \"line\" in the context of a file content refers to a sequence of characters that ends with a line break. This line can contain text, code, or any other form of data. In this specification, the line break is the ASCII LF character. The \"lines\" qualifier allows to designate a line range inside a content. The range can be a single line number, or a pair of line numbers separated by the ASCII - character. Line numbers start from 1, and the range is inclusive, i.e. the fragment includes both the lines numbered as the start and end of the range. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;lines=9-15 designates the function generate_input_stream that is found at lines 9 to 15 of the content with core SWHID swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b . Notice that the notion of \"line number\" is not always meaningful: the content may be a binary file, or a file that uses non standard line termination character(s).","title":"6.1.1 Lines qualifier"},{"location":"6.Qualified_identifiers/#612-bytes-qualifier","text":"To overcome the limitations of the lines qualifier, the bytes qualifier allows designation of a byte range inside a content. The range can be a single byte number, or a pair of byte numbers separated by - . Byte numbers start from 0, and the range is inclusive, i.e. the fragment includes both the bytes numbered as the start and end of the range. If the range is a single byte number, it designates the byte at that specific position. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;bytes=154-315 designates the same function generate_input_stream as in the example above, but does not rely on any convention about line numbers.","title":"6.1.2 Bytes qualifier"},{"location":"6.Qualified_identifiers/#62-context-qualifiers","text":"There are four context qualifiers, origin , visit , path and anchor . Each context qualifier MUST appear at most once.","title":"6.2 Context qualifiers"},{"location":"6.Qualified_identifiers/#621-origin-qualifier","text":"This qualifier allows declaration of the software origin where the object has been found or observed, as an URI. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git indicates that the content seen previously with the function generate_input_stream has been seen in the Git repository at https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git This qualifier may be helpful to get hold of the full repository where a content has been found, but there is no guarantee of success, as an origin can change or disappear over time (as is the case in the example above, since gitorious.org was shut down in 2015).","title":"6.2.1 Origin qualifier"},{"location":"6.Qualified_identifiers/#622-visit-qualifier","text":"This qualifier allows addition of the core SWHID identifier of the snapshot of the repository where the object has been found or observed. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9 indicates that the content seen previously with the function generate_input_stream has been seen in the Git repository at https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git , when its full state had the SWHID core identifier swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9 . This qualifier is only valid when the origin qualifier is also present. Otherwise, it MUST be ignored.","title":"6.2.2 Visit qualifier"},{"location":"6.Qualified_identifiers/#623-path-qualifier","text":"This qualifier allows declaration of the absolute file path , from the root directory associated to the anchor node , to the object designated by the core SWHID identifier; when the anchor denotes a directory, a revision or a release, the root directory is uniquely determined; when the anchor denotes a snapshot, the root directory is the first directory reachable from the HEAD branch, and undefined if such a reference is missing. For example, swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml indicates that the content seen previously with the function generate_input_stream has been seen in the Git repository at https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git , when its full state had the SWHID core identifier swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9 , and that it is named simplefarm.ml in the directory Simplefarm contained in the directory Examples contained in the root directory associated to the revision with core SWHID swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0 . This qualifier is only valid when the object type is not content. Otherwise, it MUST be ignored.","title":"6.2.3 Path qualifier"},{"location":"6.Qualified_identifiers/#624-anchor-qualifier","text":"This qualifier is used in conjunction with the path qualifier. It allows identification of a node in the Merkle DAG relative to which a path to the object is specified, as the core identifier of a directory, a revision, a release or a snapshot. See the example provided for the path qualifier. This qualifier is only valid when the path qualifier is also present. Otherwise, it MUST be ignored.","title":"6.2.4 Anchor qualifier"},{"location":"6.Qualified_identifiers/#63-comparing-qualified-swhids","text":"One can determine whether two software artifacts are identical (bit by bit) by comparing their core SWHIDs, ignoring all qualifiers. If the core SWHIDs are equal, the software artifacts they represent are identical. To determine if two SWHIDs represent the same software artifact (or fragment thereof) in the same context, one must also compare their qualifiers. Two SWHIDs are considered equivalent in context if: They both have the same set of qualifiers. The values of these qualifiers are identical. For instance, if both SWHIDs have an anchor qualifier, the core SWHID values of these qualifiers are identical. Similarly, if both have a lines qualifier, their values are identical. Note that the order of the qualifiers does not matter for comparison purposes.","title":"6.3 Comparing qualified SWHIDs"},{"location":"6.Qualified_identifiers/#64-recommendations","text":"We recommend equipping identifiers meant for sharing with as many qualifiers as possible. While qualifiers may be listed in any order, it is good practice to present them in the following canonical order: origin , visit , anchor , path , lines or bytes . By adhering to this order, it becomes easier to visually inspect and compare SWHIDs, especially when dealing with a large number of identifiers. Here is an example: swh:1:cnt:4d99d2d18326621ccdd70f5ea66c2e2ac236ad8b;origin=https://gitorious.org/ocamlp3l/ocamlp3l_cvs.git;visit=swh:1:snp:d7f1b9eb7ccb596c2622c4780febaa02549830f9;anchor=swh:1:rev:2db189928c94d62a3b4757b3eec68f0a4d4113f0;path=/Examples/SimpleFarm/simplefarm.ml;lines=9-15 Redundant information should be omitted: for example, if the visit is present, and the path is relative to the snapshot indicated there, then the anchor qualifier is superfluous; similarly, if the path is empty, it may be omitted.","title":"6.4 Recommendations"},{"location":"A.Conformance/","text":"Annex A Conformance (Informative) A.1 Current and Previous Versions This edition has the version number 1.1 as part of its title. Version 1.0 was the first edition of the SWHID Specification as a Publicly Available Standard, and earlier editions of the specification were published by the Software Heritage. Differences between this edition and earlier ones are reported in the text; see also [1] . A.2 Obsolete features Over the life of a standard, some older approaches can become obsolete and are dropped from subsequent editions, possibly with a replacement approach being provided. Such action involves deprecating those outdated features. This edition identifies all currently deprecated features.","title":"Annex A: Conformance"},{"location":"A.Conformance/#annex-a-conformance-informative","text":"","title":"Annex A Conformance (Informative)"},{"location":"A.Conformance/#a1-current-and-previous-versions","text":"This edition has the version number 1.1 as part of its title. Version 1.0 was the first edition of the SWHID Specification as a Publicly Available Standard, and earlier editions of the specification were published by the Software Heritage. Differences between this edition and earlier ones are reported in the text; see also [1] .","title":"A.1 Current and Previous Versions"},{"location":"A.Conformance/#a2-obsolete-features","text":"Over the life of a standard, some older approaches can become obsolete and are dropped from subsequent editions, possibly with a replacement approach being provided. Such action involves deprecating those outdated features. This edition identifies all currently deprecated features.","title":"A.2 Obsolete features"},{"location":"B.Bibliography/","text":"Annex B Bibliography (Informative) The following documents are useful references for implementers and users of this document: [1] SoftWare Heritage persistent IDentifiers ; SoftWare Heritage, https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html [Stevens2013Counter] Marc Stevens. Counter-cryptanalysis. In Advances in Cryptology, CRYPTO 2013: 33rd Annual Cryptology Conference, Santa Barbara, CA, USA, August 18-22, 2013. Proceedings, Part I (pp. 129-146). Springer Berlin Heidelberg. Open access preprint: https://eprint.iacr.org/2013/358 [Stevens2017Shattered] Marc Stevens, Elie Bursztein, Pierre Karpman, Ange Albertini, Yarik Markov. The First Collision for Full SHA-1. In Advances in Cryptology, CRYPTO 2017: 37th Annual International Cryptology Conference, Santa Barbara, CA, USA, August 20\u201324, 2017, Proceedings, Part I 37 (pp. 570-596). Springer International Publishing. Open access preprint: https://eprint.iacr.org/2017/190","title":"Annex B: Bibliography"},{"location":"B.Bibliography/#annex-b-bibliography-informative","text":"The following documents are useful references for implementers and users of this document: [1] SoftWare Heritage persistent IDentifiers ; SoftWare Heritage, https://docs.softwareheritage.org/devel/swh-model/persistent-identifiers.html [Stevens2013Counter] Marc Stevens. Counter-cryptanalysis. In Advances in Cryptology, CRYPTO 2013: 33rd Annual Cryptology Conference, Santa Barbara, CA, USA, August 18-22, 2013. Proceedings, Part I (pp. 129-146). Springer Berlin Heidelberg. Open access preprint: https://eprint.iacr.org/2013/358 [Stevens2017Shattered] Marc Stevens, Elie Bursztein, Pierre Karpman, Ange Albertini, Yarik Markov. The First Collision for Full SHA-1. In Advances in Cryptology, CRYPTO 2017: 37th Annual International Cryptology Conference, Santa Barbara, CA, USA, August 20\u201324, 2017, Proceedings, Part I 37 (pp. 570-596). Springer International Publishing. Open access preprint: https://eprint.iacr.org/2017/190","title":"Annex B Bibliography (Informative)"}]}
\ No newline at end of file
diff --git a/dev/sitemap.xml.gz b/dev/sitemap.xml.gz
index 6c1be301072e0e0467368e7de33c72ba563fcc63..e9e73c718e2131cf272a432540707098f6dee3d7 100644
GIT binary patch
delta 14
VcmX@Zc!rTpzMF$%Z{I|=V*n#x1p@#8

delta 14
VcmX@Zc!rTpzMF%is&XRRF#sXF1gii5