This documentation is not supposed to be the authoritative guide but being based partly on Wikibase/DataModel/JSON, which is partly incorrect and incomplete, and on countless hours sifting through the data will lend it some weight.
The full dump file is a JSON file - located on the wikimedia website - in various compression formats starting with the name latest-all.
The format is a mess, lengthy and hard to read, probably stuff has been added on over the years, but that's how it it is right now - please see Improved Format for some sugggestions on how to improve the format.
The file is a JSON array of all the items and properties which are conveniently placed with one item/property on one line (with UNIX line endings) making it considerably easier to read the file one line/item/property at a time, which also is - due to the excessive amount of data - very necessary.
Around 1th of July 2022 the downloaded, compressed json-file (latest-all.json.bz2) is roughly 71GB and the decompressed file (latest-all.json) is 14068MB or close to 1.4TB. That the file is so compressible does say something about the inherent redundancy in the file.
Each line is a JSON object making up either an item or a property as seen by the
type
property. From there on the formats differ, but only slightly. The basic
format of the file is like this:
[
{ ... } // item or property },
{ ... } // item or property },
...
]
Since the document is valid JSON there is a comma at the end of each line. This may be a challenge for some JSON readers.
A property is used to state something about an Item.
{
"type": // string: values: "property"
"datatype": // string: values: see below
"id": // string: values: integer starting with P - e.g. "P31"
"labels": // object
"descriptions": // object
"aliases": // object
"claims": // object
"pageid": // number: - internal something
"ns": // number: - something else internal
"title": // string: - something internal
"lastrevid": // number: An internal revision id from Wikidata
"lastmodified": // datetime - last modified date of the item
}
An item is an actual object, thing or concept, e.g. like Back to The Future, which is a movie or London, which is a city.
The general structure for items looks like this. The order of properties is important:
{
"type": // string: values: "item"
"id": // string: values: integer starting with Q - e.g. "Q91540"
"labels": // object
"descriptions": // object
"aliases": // object
"claims": // object
"sitelinks": // object
"pageid": // number: - internal something
"ns": // number: - something else internal
"title": // string: - something internal
"lastrevid": // number: An internal revision id from Wikidata
"lastmodified": // datetime - last modified date of the item
}
NOTE: According to Wikibase/DataModel/JSON
there is a modified
property for an item (and property) which is now finally available from around 1. July 2022 (after years of waiting). Thanks! Unfortunately along
with it came the properties pageid
, ns
, title
which all are utterly useless
but of course takes up valuable space, bandwidth and reading time.
Language identifiers are used all over the place and can be looked up on Help:Wikimedia language codes/lists/all, but more importantly each of them is in itself a Wikidata item with the property P424 Wikimedia language code (P424).
For some reason, though, links to languages in labels, aliases, descriptions etc. are never done with the Qxxxx item identifier but always with Wikimedia language code.
The labels
section is an object with a property for each language for which a label exists:
"labels": {
"en-gb": {
"language": "en-gb",
"value": "Northern Ireland"
},...
As it can be seen the language
property is redundant.
A label is what the "thing" is mainly called in different languages.
Since JSON only allows one instance of each property only one label can exist for each language. This also means that having the value inside an object is really not necessary.
The descriptions
section is an object with a property for each language for which a description exists:
"descriptions": {
"en-gb": {
"language": "en-gb",
"value": "region in north-west Europe, part of the United Kingdom"
},...
As it can be seen the language
property is redundant.
A description is a more thorough explanation than a label. There is not necessarily both a label and a description in the same language.
Since JSON only allows one instance of each property only one description can exist for each language.
The aliases
section is an object with a property for each language for which aliases exist. In contrast to
labels
and descriptions
there can be more than one alias
for each language, for which an array
is needed:
"aliases": {
"sco": [
{
"language": "sco",
"value": "Norlin Airlan"
},
{
"language": "sco",
"value": "Norlin Airlann"
}
],...
}
As it can be seen the language
property is very redundant.
An alias can be seen as alternative, secondary labels.
Claims are used to tell what an item actually is. The claims
section is an object with a
property for each claim. Each claim is an array of anonymous objects which together make up actual, individual claims - or statements.
Claims exist for both properties and items.
Claims are rather complex and difficult to describe, but here goes, starting with the general layout of the claims object.
Please note that the official documentation lists the properties in incorrect order which might surprise you when trying to read the JSON manually.
The correct order is as follows:
"claims": {
"P31": [ // The type of claim - here "instance of" - note the array start [.
{ // The statement - note the anonymous object
"mainsnak": {
"snaktype": string, // values: "value", "somevalue", "novalue"
"property": string, // always same as enclosing property value, e.g. "P31"
"datavalue": object, // type and layout depending on the datatype of the property
"datatype": string, // See discussion of data types and values below
},
"type": string, // Always "statement", see below
"qualifiers": array of object // optional - see below
"qualifiers-order": array of string // optional
"id": string // unique (and long...) id for the claim
"rank": string // "normal", "preferred", "deprecated",
"references": array of object // optional
},
{
}
], // Array end
"P349": [...] // Another claim
}
A qualifier qualifies a statement, it could be e.g. qualifying that the statement is only correct for a given time period.
Qualifiers look somewhat like claims:
"qualifiers": {
"P7141":[ // The type of the qualifier - here "measure number" (musical notation)
{
"snaktype":"value", // Not sure, but I think only "value" is applicable here
"property":"P7141", // Always same as enclosing property
"hash":"6e8532d38c30dde9077dcc09ea752af75420e3a1",
"datavalue":{
"value":"92\u2013107","type":"string"
},
"datatype":"string"
}]
},
"qualifiers-order":[
"P1741"
]
References:
"references":[{
"hash":"fa278ebfc458360e5aed63d5058cca83c46134f1",
"snaks":{
"P143":[{
"snaktype":"value",
"property":"P143",
"datavalue":{
"value":{
"entity-type":"item",
"numeric-id":328,
"id":"Q328"
},
"type":"wikibase-entityid"
},
"datatype":"wikibase-item"
}]
}
Sitelinks:
"sitelinks":{
"commonswiki":{
"site":"commonswiki",
"title":"Club Deportivo Mirand\u00e9s",
"badges":[]
},
"enwiki":{
"site":"enwiki",
"title":"CD Mirand\u00e9s",
"badges":[]
}
}
Each individual claim (the anonymous object) is a statement about the item - or rather about the property in the respect that it states something about the property. E.g. When the property is "P31" instance of the statement states that the item is an instance of whatever the statement's datavalue contains, e.g. a City or a Movie.
The property describes what type of a claim
we are talking about. Each claim furthermore has a datavalue
which qualifies the claim.
Note that each claim is actually a array of individual statements for that property. This is
useful e.g. for songs on an album where the property P658 (tracklist)
is a
array of claims, one for each track. Note that the order is probably not important. Probably
a qualifier (in this case P1545 (series ordinal)
) is used for stating the order.
"novalue":
When a claim deliberably does not have a value, e.g. (to use Wikidata's example) Angela Merkel has no children.
Note! Even though there is no "datavalue" property the "datatype" property is still present.
"somevalue":
Used when a claim probably should have a value, but we don't know what it is.
Even though there is no "datavalue" property the "datatype" property is still present.
"value":
In this case the claim does have an actual value, available in the "datavalue" property.
The datavalue property is always an object with the following layout:
"datavalue": {
"value": object or string
"type": string
}
The type property is one of the following values:
- globecoordinate
- monolingualtext
- quantity
- string
- time
- wikibase-entityid
The type is redundant, see discussion below:
The datatype does, naturally, tell something about the format of the data represented - unfortunately it is not very useful when sequentially parsing datavalue, since the datatype property comes after the datavalue property...
Each datatype is also represented in Wikidata as an item with the relation instance of (P31) -> Wikibase datatype (Q19798645).
All datatypes are listed on wikidata along with other useful information.
There is a consistent relation between the type of the datavalue and the datatype (in the mainsnak) as seen by this table:
datatype (of snak) | type of datavalue | Friendly name |
---|---|---|
commonsMedia | string | Commons media file |
external-id | string | External identifier |
geo-shape | string | Geographic shape |
globe-coordinate | globecoordinate | Geographic coordinates |
math | string | Mathematical expression |
monolingualtext | monolingualtext | Monolingual text |
musical-notation | string | Musical notation |
quantity | quantity | Quantity |
string | string | String |
tabular-data | string | Tabular data |
time | time | Point in time |
url | string | URL |
wikibase-form | wikibase-entityid | Form |
wikibase-item | wikibase-entityid | Item |
wikibase-lexeme | wikibase-entityid | Lexeme |
wikibase-property | wikibase-entityid | Property |
wikibase-sense | wikibase-entityid | Sense |
Note about wikibase-entityid
: According to Wikidata Documentation not all entries have numeric-id
. The meaning of this is not clear
and does sound rather strange to me, but what I think is meant
by this is that there are different wikibase-entityid
types and
not all of those have a numeric-id
.
Note: The consistent relation between type and datatype basically makes the type property redundant.