Store slim table tags in hstore #692

pnorman · 2017-02-28T01:14:47Z

I did some testing, and the representation of tags in a hstore is more efficient than our current text[] array of alternating keys and values. For testing latter can be turned into the former with hstore(tags).

Doing this on the slim table of a planet and summing the pg_column_size gives a size of 38GB for text[] representation and 30GB for hstore. The overall size of the current slim table is 94GB for the table and 190GB for the indexes so this is about a 5% space savings.

I considered jsonb, but it is the same size as hstore.

This change would require the hstore extension for osm2pgsql databases. I do not consider this an issue because hstore is a standard part of contrib, and many databases require it already.

In addition to being more efficient, if someone did want to use our internal slim tables to query by tag, it is much easier to do so with hstore.

The text was updated successfully, but these errors were encountered:

Komzpa · 2017-03-06T13:03:25Z

Internally jsonb and hstore are similar, except that jsonb is supported in core Postgres. Reasons to use hstore nowadays are about it being a legacy. If you're writing new code please use jsonb.

Komzpa · 2017-03-06T13:06:18Z

If you want to go extensions, there's between-row deduplicating jsonb drop-in replacement from the same people who made jsonb and hstore: https://github.com/postgrespro/zson

pnorman · 2017-03-06T13:13:54Z

The version requirements for jsonb would require two different implementations in osm2pgsql, one for 9.4+ and another for previous versions. This rules it out.

I do not see requiring the extension as an issue, given we require the postgis extension and many uses will require the hstore extension anyways.

This would be true regardless of if I felt if jsonb was a better option for new code. Absent good evidence on performance, I don't.

pnorman · 2017-03-06T13:15:06Z

As for zson, requiring an extension that's not in contrib and compiled by default (or part of postgis) is a barrier.

ppKrauss · 2018-10-27T12:47:27Z

Hi, I am using direct jsonb_object(text[]) to convert tags of slim tables, was fine. For geometry tables also easy, alter table USING hstore_to_jsonb_loose(tags) as commented here (instead ::jsonb default cast)... And it is not only "to being more efficient", is also to being more standard.

About other (above) comments, nowdays — 2018 and pg v9.3 end of life, the v9.4+ is the reference version — is time to use jsonb datatype... Even when equivalent (as @Komzpa say "Internally jsonb and hstore are similar"), jsonb is better. It is important an osm2pgsql option to user decision.

The jsonb datatype (a JSON-standard) is a de facto standard for PostgreSQL developer community:

if you're choosing a dynamic structure you should choose jsonb over hstore,
dba.stackexchange C. Ringer (2015)

jsonb-vs-hstore performance: hstore have equal performance only "especially if it's indexed with GIN", M. Marchel (2016)... But default osm2pgsql is without indexation, so hstore have low performance.

In most cases JSONB is likely what you want when looking for a NoSQL, schema-less, datatype (...)
(...) JSONB - In most cases;
hstore - Can work fine for text based key-value looks, but in general JSONB can still work great here, citusdata.com C. Kerstiens (2016).

The database format we have been using for the middle has some problems: * Node tags are not stored in the planet_osm_nodes table, because they are not needed for updates. But having access to those tags is useful in some cases, for instance when relations are processed after import. * Attributes (version, timestamp, changeset, uid, user) of ways and relations are stored as special pseudo-tags ("osm_*") (if --extra-attributes) is used. This has the potential of name clashes and format problems and makes the attributes difficult to access from the database. Attributes for nodes are never stored. * The way we store tags as array of text fields with keys and values intermixed ([key1, value1, key2, value2, ...]) is cumbersome to use from the database. * The way relation members are stored is rather arcane. * When using --extra-attributes/-x the middle tables become huge (due to storage in pseudo-tags). This commit fixes all those problems introducing a new database structure: * Tags are stored in JSONB columns. * The nodes table gets a new "tags" column. * Attributes are optionally stored in normal typed database columns. The columns are only added when --extra-attributes is specified and the columns can be NULL if not used which makes the overhead tiny. * Relation members are now stored as JSONB as an array of objects, for example: [{"type": "W", "ref": 123, "role": "inner"}, ...]. Using JSONB allows us to build the indexes needed to find all relations with certain members. * The format for way nodes has been kept as an array of bigints. The names of the tables PREFIX_nodes, PREFIX_ways, and PREFIX_rels (with "osm_planet" as default prefix) has been kept, but we might want to change this and get rid of the prefix, schemas are a better mechanism and they have been available for a while. There is a new table PREFIX_users which contains a user id->name lookup table. The user name isn't stored in the other tables, just the id. This saves disk space and has the added benefit of updating the user name correctly if a user name changes. There are two new command line options: * --middle-database-format=NUM - (1 for old format, 2 for new format) * --middle-with-nodes - set this to store tagged nodes in the database even if a flat-node file is used. Untagged nodes are only stored in the database if there is no flat-node file. For the first time this new format allows you to have a database created by osm2pgsql that contains *all* the information in an OSM file, all nodes, ways, and relations with all their tags and attributes. Note that this commit currently doesn't have a mechanism for detecting which format a database has when doing updates. This will be added later. For the time being you have to use the same command line options for updates that were used on import. This commit doesn't properly do testing. If you want to run all the tests with the new database format, set `middle_database_format = 2` in options.hpp and recompile. This commit adds a new dependency on a [JSON library](https://github.com/nlohmann/json). Parsing JSON isn't something we want to do ourselves. This library has been around for a while, is available everywhere and is well supported with regular releases unless the RapidJSON library we were using before. Closes osm2pgsql-dev#692 Closes osm2pgsql-dev#1170 See osm2pgsql-dev#1502

The database format we have been using for the middle has some problems: * Node tags are not stored in the planet_osm_nodes table, because they are not needed for updates. But having access to those tags is useful in some cases, for instance when relations are processed after import. * Attributes (version, timestamp, changeset, uid, user) of ways and relations are stored as special pseudo-tags ("osm_*") (if --extra-attributes) is used. This has the potential of name clashes and format problems and makes the attributes difficult to access from the database. Attributes for nodes are never stored. * The way we store tags as array of text fields with keys and values intermixed ([key1, value1, key2, value2, ...]) is cumbersome to use from the database. * The way relation members are stored is rather arcane. * When using --extra-attributes/-x the middle tables become huge (due to storage in pseudo-tags). This commit fixes all those problems introducing a new database structure: * Tags are stored in JSONB columns. * The nodes table gets a new "tags" column. * Attributes are optionally stored in normal typed database columns. The columns are only added when --extra-attributes is specified and the columns can be NULL if not used which makes the overhead tiny. * Relation members are now stored as JSONB as an array of objects, for example: [{"type": "W", "ref": 123, "role": "inner"}, ...]. Using JSONB allows us to build the indexes needed to find all relations with certain members. * The format for way nodes has been kept as an array of bigints. The names of the tables PREFIX_nodes, PREFIX_ways, and PREFIX_rels (with "osm_planet" as default prefix) has been kept, but we might want to change this and get rid of the prefix, schemas are a better mechanism and they have been available for a while. There is a new table PREFIX_users which contains a user id->name lookup table. The user name isn't stored in the other tables, just the id. This saves disk space and has the added benefit of updating the user name correctly if a user name changes. There are two new command line options: * --middle-database-format=FORMAT - 'legacy' (default) or 'new' * --middle-with-nodes - set this to store tagged nodes in the database even if a flat-node file is used. Untagged nodes are only stored in the database if there is no flat-node file. For the first time this new format allows you to have a database created by osm2pgsql that contains *all* the information in an OSM file, all nodes, ways, and relations with all their tags and attributes. A new property "db_format" is written to the osm2pgsql_properties table with the value "0" (non-slim import), "1" (slim import with legacy format) or "2" (slim import with new format). This is read in append mode and handled appropriately. This commit doesn't properly do testing. If you want to run all the tests with the new database format, you have to set `options.middle_database_format = 2` in osm2pgsql.cpp and recompile. We'll need a better solution for that. This commit adds a new dependency on a [JSON library](https://github.com/nlohmann/json). Parsing JSON isn't something we want to do ourselves. This library has been around for a while, is available everywhere and is well supported with regular releases unless the RapidJSON library we were using before. Closes osm2pgsql-dev#692 Closes osm2pgsql-dev#1170 See osm2pgsql-dev#1502

The database format we have been using for the middle has some problems: * Node tags are not stored in the planet_osm_nodes table, because they are not needed for updates. But having access to those tags is useful in some cases, for instance when relations are processed after import. * Attributes (version, timestamp, changeset, uid, user) of ways and relations are stored as special pseudo-tags ("osm_*") (if --extra-attributes) is used. This has the potential of name clashes and format problems and makes the attributes difficult to access from the database. Attributes for nodes are never stored. * The way we store tags as array of text fields with keys and values intermixed ([key1, value1, key2, value2, ...]) is cumbersome to use from the database. * The way relation members are stored is rather arcane. * When using --extra-attributes/-x the middle tables become huge (due to storage in pseudo-tags). This commit fixes all those problems introducing a new database structure: * Tags are stored in JSONB columns. * The nodes table gets a new "tags" column. * Attributes are optionally stored in normal typed database columns. The columns are only added when --extra-attributes is specified and the columns can be NULL if not used which makes the overhead tiny. * Relation members are now stored as JSONB as an array of objects, for example: [{"type": "W", "ref": 123, "role": "inner"}, ...]. Using JSONB allows us to build the indexes needed to find all relations with certain members. * The format for way nodes has been kept as an array of bigints. The names of the tables PREFIX_nodes, PREFIX_ways, and PREFIX_rels (with "osm_planet" as default prefix) has been kept, but we might want to change this and get rid of the prefix, schemas are a better mechanism and they have been available for a while. There is a new table PREFIX_users which contains a user id->name lookup table. The user name isn't stored in the other tables, just the id. This saves disk space and has the added benefit of updating the user name correctly if a user name changes. There are two new command line options: * --middle-database-format=FORMAT - 'legacy' (default) or 'new' * --middle-with-nodes - set this to store tagged nodes in the database even if a flat-node file is used. Untagged nodes are only stored in the database if there is no flat-node file. For the first time this new format allows you to have a database created by osm2pgsql that contains *all* the information in an OSM file, all nodes, ways, and relations with all their tags and attributes. A new property "db_format" is written to the osm2pgsql_properties table with the value "0" (non-slim import), "1" (slim import with legacy format) or "2" (slim import with new format). This is read in append mode and handled appropriately. This commit adds a new dependency on a [JSON library](https://github.com/nlohmann/json). Parsing JSON isn't something we want to do ourselves. This library has been around for a while, is available everywhere and is well supported with regular releases unless the RapidJSON library we were using before. Closes osm2pgsql-dev#692 Closes osm2pgsql-dev#1170 See osm2pgsql-dev#1502

joto mentioned this issue May 27, 2023

New database middle #1969

Merged

lonvia closed this as completed in #1969 Jul 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store slim table tags in hstore #692

Store slim table tags in hstore #692

pnorman commented Feb 28, 2017

Komzpa commented Mar 6, 2017

Komzpa commented Mar 6, 2017

pnorman commented Mar 6, 2017

pnorman commented Mar 6, 2017

ppKrauss commented Oct 27, 2018

Store slim table tags in hstore #692

Store slim table tags in hstore #692

Comments

pnorman commented Feb 28, 2017

Komzpa commented Mar 6, 2017

Komzpa commented Mar 6, 2017

pnorman commented Mar 6, 2017

pnorman commented Mar 6, 2017

ppKrauss commented Oct 27, 2018