Skip to content

Output Data Structure

fmorini edited this page Apr 25, 2022 · 7 revisions

General purpose

This page describes how the text content is transformed into JSON starting from the original metatags. It serves as a guide for whoever intends to adopt the same tagging syntax and logic, as well as for whoever wants to approach the problem of tagging and parsing unstructured text to obtain a structured and inter-operable dataset.


JSON Structure

The JSON is nested according to the structure of the original text:

|– Paragraph Level
    | – Entity level
        | – Time point level

A blurb of text is relevant when is wrapped within a <div> with a unique id, in this case EssayBody. Paragraphs are blurb of texts contain within <p></p> tags, usually carrying a meaningful id such as P1, P2, etc. Entities represent single temporal entities and hold precise information on their characteristics. One entity can include multiple time points. On the time point level precise information on the entity position is specified, while instants are by design represented by one single time point, intervals can contain one or more time points.

The output data will look something like:

{
  {
    "paragraphNumber": 1,
    "entities": [
      {
        "time:inXSDgYear": -1999997978,
        "rdfs:label": "2 Billion Years Ago",
        "ac:hasIndefiniteness": "1",
        "time:positionInText": 1
      },
     , {...}, {...}]
  }, {...}, {...}
},

Properties of the data

On a paragraph level:

  • paragraphNumber: Int | Indicates in which paragraph we are currently in
  • instances: Array | contains individual entities

On the entities level:

  • resource: String | Unique identifier for the entity
  • entityType: String | Indicates if entity is an interval or an instant
  • targets: Array | contains string identifiers of entities targeted by this entity
  • targetedBy: Array | contains string identifiers of entities that target this entity
  • textLabel: String | can be a custom label specified in the metatags or the content of the <span>.
  • timePoints: Array | contains individual time points with respective x (deep time) and y (time of the telling) positions.

On the time Points level:

  • "time:inXSDgYear": Int | Proxy negative or positive value encoding one year according to the Gregorian calendar (e.g.: -1999997978)
  • "rdfs:label": String | Human readable label for the time:inXSDgYear property (e.g.: "2 Billion Years Ago")
  • "ac:hasIndefiniteness": Float | Values from 0 to 1, encoding the uncertainty for a placing of a certain event (e.g.: 1)
  • "time:positionInText": Int | Index value of the element within text, determines where in the original artifact this element is positioned (e.g.: 54)

One data sample can be found here.

Clone this wiki locally