From 339cec1f4eb278fc4bf621b726e62e222abb1bbf Mon Sep 17 00:00:00 2001 From: Gabor Szarnyas Date: Tue, 22 Oct 2024 15:11:24 +0200 Subject: [PATCH] JSON pages WIP --- _posts/2023-03-03-json.md | 226 +++--- data/todos.json | 1208 +++++++++++++++++++++++++++++- docs/data/json/json_functions.md | 5 +- docs/data/json/loading_json.md | 15 + docs/data/json/overview.md | 43 +- 5 files changed, 1374 insertions(+), 123 deletions(-) diff --git a/_posts/2023-03-03-json.md b/_posts/2023-03-03-json.md index 048e2a1e24..313ad09f89 100644 --- a/_posts/2023-03-03-json.md +++ b/_posts/2023-03-03-json.md @@ -19,7 +19,7 @@ These functions are similar to the JSON functionality provided by other database DuckDB uses [yyjson](https://github.com/ibireme/yyjson) internally to parse JSON, a high-performance JSON library written in ANSI C. Many thanks to the yyjson authors and contributors! Besides these functions, DuckDB is now able to read JSON directly! -This is done by automatically detecting the types and column names, then converting the values within the JSON to DuckDB's vectors. +This is done by automatically detecting the types and column names, then converting the values within the JSON to DuckDB's vectors. The automated schema detection dramatically simplifies working with JSON data and subsequent queries on DuckDB's vectors are significantly faster! ## Reading JSON Automatically with DuckDB @@ -64,7 +64,7 @@ SELECT * FROM 'todos.json'; Now, finding out which user completed the most TODO items is as simple as: ```sql -SELECT userId, sum(completed::int) total_completed +SELECT userId, sum(completed::INTEGER) AS total_completed FROM 'todos.json' GROUP BY userId ORDER BY total_completed DESC @@ -83,7 +83,7 @@ DuckDB will read multiple files in parallel. ## Newline Delimited JSON -Not all JSON adheres to the format used in `todos.json`, which is an array of 'records'. +Not all JSON adheres to the format used in `todos.json`, which is an array of “records”. Newline-delimited JSON, or [NDJSON](http://ndjson.org), stores each row on a new line. DuckDB also supports reading (and writing!) this format. First, let's write our TODO list as NDJSON: @@ -111,7 +111,7 @@ This is specified with `nd` or the `lines` parameter: ```sql SELECT * FROM read_ndjson_auto('todos2.json'); -SELECT * FROM read_json_auto('todos2.json', lines='true'); +SELECT * FROM read_json_auto('todos2.json', lines = 'true'); ``` You can also set `lines='auto'` to auto-detect whether the JSON file is newline-delimited. @@ -124,8 +124,8 @@ The first `json_format` is `'array_of_records'`, while the second is `'records'` This can be specified like so: ```sql -SELECT * FROM read_json('todos.json', auto_detect=true, json_format='array_of_records'); -SELECT * FROM read_json('todos2.json', auto_detect=true, json_format='records'); +SELECT * FROM read_json('todos.json', format = 'array', records = true); -- ' json_format = 'array_of_records' +SELECT * FROM read_json('todos2.json', format = 'newline_delimited', records = true); -- json_format = 'records' ``` Other supported formats are `'values'` and `'array_of_values'`, which are similar to `'records'` and `'array_of_records'`. @@ -133,22 +133,25 @@ However, with these formats, each 'record' is not required to be a JSON object b ## Manual Schemas -What you may also have noticed is the `auto_detect` parameter. -This parameter tells DuckDB to infer the schema, i.e., determine the names and types of the returned columns. +DuckDB infers the schema, i.e., determines the names and types of the returned columns. These can manually be specified like so: ```sql -SELECT * FROM read_json('todos.json', - columns={userId: 'INT', id: 'INT', title: 'VARCHAR', completed: 'BOOLEAN'}, - json_format='array_of_records'); +SELECT * +FROM read_json('todos.json', + columns = {userId: 'INTEGER', id: 'INTEGER', title: 'VARCHAR', completed: 'BOOLEAN'}, + json_format = 'array_of_records' + ); -- TODO: format // records ``` You don't have to specify all fields, just the ones you're interested in: ```sql -SELECT * FROM read_json('todos.json', - columns={userId: 'INT', completed: 'BOOLEAN'}, - json_format='array_of_records'); +SELECT * +FROM read_json('todos.json', + columns = {userId: 'INTEGER', completed: 'BOOLEAN'}, + json_format = 'array_of_records' + ); ``` Now that we know how to use the new DuckDB JSON table functions let's dive into some analytics! @@ -191,9 +194,9 @@ To get a feel of what the data looks like, we run the following query: ```sql SELECT json_group_structure(json) FROM ( - SELECT * - FROM read_ndjson_objects('gharchive_gz/*.json.gz') - LIMIT 2048 + SELECT * + FROM read_ndjson_objects('gharchive_gz/*.json.gz') + LIMIT 2048 ); ``` @@ -238,7 +241,8 @@ I've left `"payload"` out because it consists of deeply nested JSON, and its for So, how many records are we dealing with exactly? Let's count it using DuckDB: ```sql -SELECT count(*) count FROM 'gharchive_gz/*.json.gz'; +SELECT count(*) AS count +FROM 'gharchive_gz/*.json.gz'; ``` | count | @@ -356,12 +360,12 @@ This is more activity than normal because most of the DuckDB developers were bus Now, let's see who was the most active: ```sql -SELECT actor.login, count(*) count +SELECT actor.login, count(*) AS count FROM events WHERE repo.name = 'duckdb/duckdb' AND type = 'PullRequestEvent' GROUP BY actor.login -ORDER BY count desc +ORDER BY count DESC LIMIT 5; ``` @@ -383,29 +387,29 @@ We've ignored it because the contents of this field are different based on the t We can see how they differ with the following query: ```sql -SELECT json_group_structure(payload) structure +SELECT json_group_structure(payload) AS structure FROM (SELECT * - FROM read_json( - 'gharchive_gz/*.json.gz', - columns={ - id: 'BIGINT', - type: 'VARCHAR', - actor: 'STRUCT(id UBIGINT, - login VARCHAR, - display_login VARCHAR, - gravatar_id VARCHAR, - url VARCHAR, - avatar_url VARCHAR)', - repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)', - payload: 'JSON', - public: 'BOOLEAN', - created_at: 'TIMESTAMP', - org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)' - }, - lines='true' - ) - WHERE type = 'WatchEvent' - LIMIT 2048 + FROM read_json( + 'gharchive_gz/*.json.gz', + columns = { + id: 'BIGINT', + type: 'VARCHAR', + actor: 'STRUCT(id UBIGINT, + login VARCHAR, + display_login VARCHAR, + gravatar_id VARCHAR, + url VARCHAR, + avatar_url VARCHAR)', + repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)', + payload: 'JSON', + public: 'BOOLEAN', + created_at: 'TIMESTAMP', + org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)' + }, + lines = 'true' + ) + WHERE type = 'WatchEvent' + LIMIT 2048 ); ``` @@ -491,47 +495,49 @@ Note that because we are not auto-detecting the schema, we have to supply `times The key `"user"` must be surrounded by quotes because it is a reserved keyword in SQL: ```sql -CREATE TABLE pr_events as -SELECT * -FROM read_json( - 'gharchive_gz/*.json.gz', - columns={ - id: 'BIGINT', - type: 'VARCHAR', - actor: 'STRUCT(id UBIGINT, - login VARCHAR, - display_login VARCHAR, - gravatar_id VARCHAR, - url VARCHAR, - avatar_url VARCHAR)', - repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)', - payload: 'STRUCT( - action VARCHAR, - number UBIGINT, - pull_request STRUCT( - url VARCHAR, - id UBIGINT, - title VARCHAR, - "user" STRUCT( - login VARCHAR, - id UBIGINT - ), - body VARCHAR, - created_at TIMESTAMP, - updated_at TIMESTAMP, - assignee STRUCT(login VARCHAR, id UBIGINT), - assignees STRUCT(login VARCHAR, id UBIGINT)[] - ) - )', - public: 'BOOLEAN', - created_at: 'TIMESTAMP', - org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)' - }, - json_format='records', - lines='true', - timestampformat='%Y-%m-%dT%H:%M:%SZ' -) -WHERE type = 'PullRequestEvent'; +CREATE TABLE pr_events AS + SELECT * + FROM read_json( + 'gharchive_gz/*.json.gz', + columns = { + id: 'BIGINT', + type: 'VARCHAR', + actor: 'STRUCT( + id UBIGINT, + login VARCHAR, + display_login VARCHAR, + gravatar_id VARCHAR, + url VARCHAR, + avatar_url VARCHAR + )', + repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)', + payload: 'STRUCT( + action VARCHAR, + number UBIGINT, + pull_request STRUCT( + url VARCHAR, + id UBIGINT, + title VARCHAR, + "user" STRUCT( + login VARCHAR, + id UBIGINT + ), + body VARCHAR, + created_at TIMESTAMP, + updated_at TIMESTAMP, + assignee STRUCT(login VARCHAR, id UBIGINT), + assignees STRUCT(login VARCHAR, id UBIGINT)[] + ) + )', + public: 'BOOLEAN', + created_at: 'TIMESTAMP', + org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)' + }, + json_format = 'records', + lines = 'true', + timestampformat = '%Y-%m-%dT%H:%M:%SZ' + ) + WHERE type = 'PullRequestEvent'; ``` This query completes in around 36s with an on-disk database (resulting size is 478MB) and 9s with an in-memory database. @@ -561,13 +567,13 @@ We can check who was assigned the most: ```sql WITH assignees AS ( - SELECT payload.pull_request.assignee.login assignee - FROM pr_events - UNION ALL - SELECT unnest(payload.pull_request.assignees).login assignee - FROM pr_events + SELECT payload.pull_request.assignee.login assignee + FROM pr_events + UNION ALL + SELECT unnest(payload.pull_request.assignees).login assignee + FROM pr_events ) -SELECT assignee, count(*) count +SELECT assignee, count(*) AS count FROM assignees WHERE assignee NOT NULL GROUP BY assignee @@ -596,25 +602,25 @@ If you don't want to specify the schema of a field, you can set the type as `'JS CREATE TABLE pr_events AS SELECT * FROM read_json( - 'gharchive_gz/*.json.gz', - columns={ - id: 'BIGINT', - type: 'VARCHAR', - actor: 'STRUCT(id UBIGINT, - login VARCHAR, - display_login VARCHAR, - gravatar_id VARCHAR, - url VARCHAR, - avatar_url VARCHAR)', - repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)', - payload: 'JSON', - public: 'BOOLEAN', - created_at: 'TIMESTAMP', - org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)' - }, - json_format='records', - lines='true', - timestampformat='%Y-%m-%dT%H:%M:%SZ' + 'gharchive_gz/*.json.gz', + columns = { + id: 'BIGINT', + type: 'VARCHAR', + actor: 'STRUCT(id UBIGINT, + login VARCHAR, + display_login VARCHAR, + gravatar_id VARCHAR, + url VARCHAR, + avatar_url VARCHAR)', + repo: 'STRUCT(id UBIGINT, name VARCHAR, url VARCHAR)', + payload: 'JSON', + public: 'BOOLEAN', + created_at: 'TIMESTAMP', + org: 'STRUCT(id UBIGINT, login VARCHAR, gravatar_id VARCHAR, url VARCHAR, avatar_url VARCHAR)' + }, + json_format = 'records', + lines = 'true', + timestampformat = '%Y-%m-%dT%H:%M:%SZ' ) WHERE type = 'PullRequestEvent'; ``` @@ -623,7 +629,7 @@ This will load the `"payload"` field as a JSON string, and we can use DuckDB's J For example: ```sql -SELECT DISTINCT payload->>'action' AS action, count(*) count +SELECT DISTINCT payload->>'action' AS action, count(*) AS count FROM pr_events GROUP BY action ORDER BY count DESC; @@ -646,7 +652,7 @@ As we can see, only a few pull requests have been reopened. DuckDB tries to be an easy-to-use tool that can read all kinds of data formats. In the 0.7.0 release, we have added support for reading JSON. JSON comes in many formats and all kinds of schemas. -DuckDB's rich support for nested types (`LIST`, `STRUCT`) allows it to fully 'shred' the JSON to a columnar format for more efficient analysis. +DuckDB's rich support for nested types (`LIST`, `STRUCT`) allows it to fully “shred” the JSON to a columnar format for more efficient analysis. We are excited to hear what you think about our new JSON functionality. If you have any questions or suggestions, please reach out to us on [Discord](https://discord.com/invite/tcvwpjfnZx) or [GitHub](https://github.com/duckdb/duckdb)! diff --git a/data/todos.json b/data/todos.json index 1a92942553..799b8322d2 100644 --- a/data/todos.json +++ b/data/todos.json @@ -1,6 +1,1202 @@ -{ - "userId": 3, - "id": 42, - "title": "rerum perferendis error quia ut eveniet", - "completed": false -} \ No newline at end of file +[ + { + "userId": 1, + "id": 1, + "title": "delectus aut autem", + "completed": false + }, + { + "userId": 1, + "id": 2, + "title": "quis ut nam facilis et officia qui", + "completed": false + }, + { + "userId": 1, + "id": 3, + "title": "fugiat veniam minus", + "completed": false + }, + { + "userId": 1, + "id": 4, + "title": "et porro tempora", + "completed": true + }, + { + "userId": 1, + "id": 5, + "title": "laboriosam mollitia et enim quasi adipisci quia provident illum", + "completed": false + }, + { + "userId": 1, + "id": 6, + "title": "qui ullam ratione quibusdam voluptatem quia omnis", + "completed": false + }, + { + "userId": 1, + "id": 7, + "title": "illo expedita consequatur quia in", + "completed": false + }, + { + "userId": 1, + "id": 8, + "title": "quo adipisci enim quam ut ab", + "completed": true + }, + { + "userId": 1, + "id": 9, + "title": "molestiae perspiciatis ipsa", + "completed": false + }, + { + "userId": 1, + "id": 10, + "title": "illo est ratione doloremque quia maiores aut", + "completed": true + }, + { + "userId": 1, + "id": 11, + "title": "vero rerum temporibus dolor", + "completed": true + }, + { + "userId": 1, + "id": 12, + "title": "ipsa repellendus fugit nisi", + "completed": true + }, + { + "userId": 1, + "id": 13, + "title": "et doloremque nulla", + "completed": false + }, + { + "userId": 1, + "id": 14, + "title": "repellendus sunt dolores architecto voluptatum", + "completed": true + }, + { + "userId": 1, + "id": 15, + "title": "ab voluptatum amet voluptas", + "completed": true + }, + { + "userId": 1, + "id": 16, + "title": "accusamus eos facilis sint et aut voluptatem", + "completed": true + }, + { + "userId": 1, + "id": 17, + "title": "quo laboriosam deleniti aut qui", + "completed": true + }, + { + "userId": 1, + "id": 18, + "title": "dolorum est consequatur ea mollitia in culpa", + "completed": false + }, + { + "userId": 1, + "id": 19, + "title": "molestiae ipsa aut voluptatibus pariatur dolor nihil", + "completed": true + }, + { + "userId": 1, + "id": 20, + "title": "ullam nobis libero sapiente ad optio sint", + "completed": true + }, + { + "userId": 2, + "id": 21, + "title": "suscipit repellat esse quibusdam voluptatem incidunt", + "completed": false + }, + { + "userId": 2, + "id": 22, + "title": "distinctio vitae autem nihil ut molestias quo", + "completed": true + }, + { + "userId": 2, + "id": 23, + "title": "et itaque necessitatibus maxime molestiae qui quas velit", + "completed": false + }, + { + "userId": 2, + "id": 24, + "title": "adipisci non ad dicta qui amet quaerat doloribus ea", + "completed": false + }, + { + "userId": 2, + "id": 25, + "title": "voluptas quo tenetur perspiciatis explicabo natus", + "completed": true + }, + { + "userId": 2, + "id": 26, + "title": "aliquam aut quasi", + "completed": true + }, + { + "userId": 2, + "id": 27, + "title": "veritatis pariatur delectus", + "completed": true + }, + { + "userId": 2, + "id": 28, + "title": "nesciunt totam sit blanditiis sit", + "completed": false + }, + { + "userId": 2, + "id": 29, + "title": "laborum aut in quam", + "completed": false + }, + { + "userId": 2, + "id": 30, + "title": "nemo perspiciatis repellat ut dolor libero commodi blanditiis omnis", + "completed": true + }, + { + "userId": 2, + "id": 31, + "title": "repudiandae totam in est sint facere fuga", + "completed": false + }, + { + "userId": 2, + "id": 32, + "title": "earum doloribus ea doloremque quis", + "completed": false + }, + { + "userId": 2, + "id": 33, + "title": "sint sit aut vero", + "completed": false + }, + { + "userId": 2, + "id": 34, + "title": "porro aut necessitatibus eaque distinctio", + "completed": false + }, + { + "userId": 2, + "id": 35, + "title": "repellendus veritatis molestias dicta incidunt", + "completed": true + }, + { + "userId": 2, + "id": 36, + "title": "excepturi deleniti adipisci voluptatem et neque optio illum ad", + "completed": true + }, + { + "userId": 2, + "id": 37, + "title": "sunt cum tempora", + "completed": false + }, + { + "userId": 2, + "id": 38, + "title": "totam quia non", + "completed": false + }, + { + "userId": 2, + "id": 39, + "title": "doloremque quibusdam asperiores libero corrupti illum qui omnis", + "completed": false + }, + { + "userId": 2, + "id": 40, + "title": "totam atque quo nesciunt", + "completed": true + }, + { + "userId": 3, + "id": 41, + "title": "aliquid amet impedit consequatur aspernatur placeat eaque fugiat suscipit", + "completed": false + }, + { + "userId": 3, + "id": 42, + "title": "rerum perferendis error quia ut eveniet", + "completed": false + }, + { + "userId": 3, + "id": 43, + "title": "tempore ut sint quis recusandae", + "completed": true + }, + { + "userId": 3, + "id": 44, + "title": "cum debitis quis accusamus doloremque ipsa natus sapiente omnis", + "completed": true + }, + { + "userId": 3, + "id": 45, + "title": "velit soluta adipisci molestias reiciendis harum", + "completed": false + }, + { + "userId": 3, + "id": 46, + "title": "vel voluptatem repellat nihil placeat corporis", + "completed": false + }, + { + "userId": 3, + "id": 47, + "title": "nam qui rerum fugiat accusamus", + "completed": false + }, + { + "userId": 3, + "id": 48, + "title": "sit reprehenderit omnis quia", + "completed": false + }, + { + "userId": 3, + "id": 49, + "title": "ut necessitatibus aut maiores debitis officia blanditiis velit et", + "completed": false + }, + { + "userId": 3, + "id": 50, + "title": "cupiditate necessitatibus ullam aut quis dolor voluptate", + "completed": true + }, + { + "userId": 3, + "id": 51, + "title": "distinctio exercitationem ab doloribus", + "completed": false + }, + { + "userId": 3, + "id": 52, + "title": "nesciunt dolorum quis recusandae ad pariatur ratione", + "completed": false + }, + { + "userId": 3, + "id": 53, + "title": "qui labore est occaecati recusandae aliquid quam", + "completed": false + }, + { + "userId": 3, + "id": 54, + "title": "quis et est ut voluptate quam dolor", + "completed": true + }, + { + "userId": 3, + "id": 55, + "title": "voluptatum omnis minima qui occaecati provident nulla voluptatem ratione", + "completed": true + }, + { + "userId": 3, + "id": 56, + "title": "deleniti ea temporibus enim", + "completed": true + }, + { + "userId": 3, + "id": 57, + "title": "pariatur et magnam ea doloribus similique voluptatem rerum quia", + "completed": false + }, + { + "userId": 3, + "id": 58, + "title": "est dicta totam qui explicabo doloribus qui dignissimos", + "completed": false + }, + { + "userId": 3, + "id": 59, + "title": "perspiciatis velit id laborum placeat iusto et aliquam odio", + "completed": false + }, + { + "userId": 3, + "id": 60, + "title": "et sequi qui architecto ut adipisci", + "completed": true + }, + { + "userId": 4, + "id": 61, + "title": "odit optio omnis qui sunt", + "completed": true + }, + { + "userId": 4, + "id": 62, + "title": "et placeat et tempore aspernatur sint numquam", + "completed": false + }, + { + "userId": 4, + "id": 63, + "title": "doloremque aut dolores quidem fuga qui nulla", + "completed": true + }, + { + "userId": 4, + "id": 64, + "title": "voluptas consequatur qui ut quia magnam nemo esse", + "completed": false + }, + { + "userId": 4, + "id": 65, + "title": "fugiat pariatur ratione ut asperiores necessitatibus magni", + "completed": false + }, + { + "userId": 4, + "id": 66, + "title": "rerum eum molestias autem voluptatum sit optio", + "completed": false + }, + { + "userId": 4, + "id": 67, + "title": "quia voluptatibus voluptatem quos similique maiores repellat", + "completed": false + }, + { + "userId": 4, + "id": 68, + "title": "aut id perspiciatis voluptatem iusto", + "completed": false + }, + { + "userId": 4, + "id": 69, + "title": "doloribus sint dolorum ab adipisci itaque dignissimos aliquam suscipit", + "completed": false + }, + { + "userId": 4, + "id": 70, + "title": "ut sequi accusantium et mollitia delectus sunt", + "completed": false + }, + { + "userId": 4, + "id": 71, + "title": "aut velit saepe ullam", + "completed": false + }, + { + "userId": 4, + "id": 72, + "title": "praesentium facilis facere quis harum voluptatibus voluptatem eum", + "completed": false + }, + { + "userId": 4, + "id": 73, + "title": "sint amet quia totam corporis qui exercitationem commodi", + "completed": true + }, + { + "userId": 4, + "id": 74, + "title": "expedita tempore nobis eveniet laborum maiores", + "completed": false + }, + { + "userId": 4, + "id": 75, + "title": "occaecati adipisci est possimus totam", + "completed": false + }, + { + "userId": 4, + "id": 76, + "title": "sequi dolorem sed", + "completed": true + }, + { + "userId": 4, + "id": 77, + "title": "maiores aut nesciunt delectus exercitationem vel assumenda eligendi at", + "completed": false + }, + { + "userId": 4, + "id": 78, + "title": "reiciendis est magnam amet nemo iste recusandae impedit quaerat", + "completed": false + }, + { + "userId": 4, + "id": 79, + "title": "eum ipsa maxime ut", + "completed": true + }, + { + "userId": 4, + "id": 80, + "title": "tempore molestias dolores rerum sequi voluptates ipsum consequatur", + "completed": true + }, + { + "userId": 5, + "id": 81, + "title": "suscipit qui totam", + "completed": true + }, + { + "userId": 5, + "id": 82, + "title": "voluptates eum voluptas et dicta", + "completed": false + }, + { + "userId": 5, + "id": 83, + "title": "quidem at rerum quis ex aut sit quam", + "completed": true + }, + { + "userId": 5, + "id": 84, + "title": "sunt veritatis ut voluptate", + "completed": false + }, + { + "userId": 5, + "id": 85, + "title": "et quia ad iste a", + "completed": true + }, + { + "userId": 5, + "id": 86, + "title": "incidunt ut saepe autem", + "completed": true + }, + { + "userId": 5, + "id": 87, + "title": "laudantium quae eligendi consequatur quia et vero autem", + "completed": true + }, + { + "userId": 5, + "id": 88, + "title": "vitae aut excepturi laboriosam sint aliquam et et accusantium", + "completed": false + }, + { + "userId": 5, + "id": 89, + "title": "sequi ut omnis et", + "completed": true + }, + { + "userId": 5, + "id": 90, + "title": "molestiae nisi accusantium tenetur dolorem et", + "completed": true + }, + { + "userId": 5, + "id": 91, + "title": "nulla quis consequatur saepe qui id expedita", + "completed": true + }, + { + "userId": 5, + "id": 92, + "title": "in omnis laboriosam", + "completed": true + }, + { + "userId": 5, + "id": 93, + "title": "odio iure consequatur molestiae quibusdam necessitatibus quia sint", + "completed": true + }, + { + "userId": 5, + "id": 94, + "title": "facilis modi saepe mollitia", + "completed": false + }, + { + "userId": 5, + "id": 95, + "title": "vel nihil et molestiae iusto assumenda nemo quo ut", + "completed": true + }, + { + "userId": 5, + "id": 96, + "title": "nobis suscipit ducimus enim asperiores voluptas", + "completed": false + }, + { + "userId": 5, + "id": 97, + "title": "dolorum laboriosam eos qui iure aliquam", + "completed": false + }, + { + "userId": 5, + "id": 98, + "title": "debitis accusantium ut quo facilis nihil quis sapiente necessitatibus", + "completed": true + }, + { + "userId": 5, + "id": 99, + "title": "neque voluptates ratione", + "completed": false + }, + { + "userId": 5, + "id": 100, + "title": "excepturi a et neque qui expedita vel voluptate", + "completed": false + }, + { + "userId": 6, + "id": 101, + "title": "explicabo enim cumque porro aperiam occaecati minima", + "completed": false + }, + { + "userId": 6, + "id": 102, + "title": "sed ab consequatur", + "completed": false + }, + { + "userId": 6, + "id": 103, + "title": "non sunt delectus illo nulla tenetur enim omnis", + "completed": false + }, + { + "userId": 6, + "id": 104, + "title": "excepturi non laudantium quo", + "completed": false + }, + { + "userId": 6, + "id": 105, + "title": "totam quia dolorem et illum repellat voluptas optio", + "completed": true + }, + { + "userId": 6, + "id": 106, + "title": "ad illo quis voluptatem temporibus", + "completed": true + }, + { + "userId": 6, + "id": 107, + "title": "praesentium facilis omnis laudantium fugit ad iusto nihil nesciunt", + "completed": false + }, + { + "userId": 6, + "id": 108, + "title": "a eos eaque nihil et exercitationem incidunt delectus", + "completed": true + }, + { + "userId": 6, + "id": 109, + "title": "autem temporibus harum quisquam in culpa", + "completed": true + }, + { + "userId": 6, + "id": 110, + "title": "aut aut ea corporis", + "completed": true + }, + { + "userId": 6, + "id": 111, + "title": "magni accusantium labore et id quis provident", + "completed": false + }, + { + "userId": 6, + "id": 112, + "title": "consectetur impedit quisquam qui deserunt non rerum consequuntur eius", + "completed": false + }, + { + "userId": 6, + "id": 113, + "title": "quia atque aliquam sunt impedit voluptatum rerum assumenda nisi", + "completed": false + }, + { + "userId": 6, + "id": 114, + "title": "cupiditate quos possimus corporis quisquam exercitationem beatae", + "completed": false + }, + { + "userId": 6, + "id": 115, + "title": "sed et ea eum", + "completed": false + }, + { + "userId": 6, + "id": 116, + "title": "ipsa dolores vel facilis ut", + "completed": true + }, + { + "userId": 6, + "id": 117, + "title": "sequi quae est et qui qui eveniet asperiores", + "completed": false + }, + { + "userId": 6, + "id": 118, + "title": "quia modi consequatur vero fugiat", + "completed": false + }, + { + "userId": 6, + "id": 119, + "title": "corporis ducimus ea perspiciatis iste", + "completed": false + }, + { + "userId": 6, + "id": 120, + "title": "dolorem laboriosam vel voluptas et aliquam quasi", + "completed": false + }, + { + "userId": 7, + "id": 121, + "title": "inventore aut nihil minima laudantium hic qui omnis", + "completed": true + }, + { + "userId": 7, + "id": 122, + "title": "provident aut nobis culpa", + "completed": true + }, + { + "userId": 7, + "id": 123, + "title": "esse et quis iste est earum aut impedit", + "completed": false + }, + { + "userId": 7, + "id": 124, + "title": "qui consectetur id", + "completed": false + }, + { + "userId": 7, + "id": 125, + "title": "aut quasi autem iste tempore illum possimus", + "completed": false + }, + { + "userId": 7, + "id": 126, + "title": "ut asperiores perspiciatis veniam ipsum rerum saepe", + "completed": true + }, + { + "userId": 7, + "id": 127, + "title": "voluptatem libero consectetur rerum ut", + "completed": true + }, + { + "userId": 7, + "id": 128, + "title": "eius omnis est qui voluptatem autem", + "completed": false + }, + { + "userId": 7, + "id": 129, + "title": "rerum culpa quis harum", + "completed": false + }, + { + "userId": 7, + "id": 130, + "title": "nulla aliquid eveniet harum laborum libero alias ut unde", + "completed": true + }, + { + "userId": 7, + "id": 131, + "title": "qui ea incidunt quis", + "completed": false + }, + { + "userId": 7, + "id": 132, + "title": "qui molestiae voluptatibus velit iure harum quisquam", + "completed": true + }, + { + "userId": 7, + "id": 133, + "title": "et labore eos enim rerum consequatur sunt", + "completed": true + }, + { + "userId": 7, + "id": 134, + "title": "molestiae doloribus et laborum quod ea", + "completed": false + }, + { + "userId": 7, + "id": 135, + "title": "facere ipsa nam eum voluptates reiciendis vero qui", + "completed": false + }, + { + "userId": 7, + "id": 136, + "title": "asperiores illo tempora fuga sed ut quasi adipisci", + "completed": false + }, + { + "userId": 7, + "id": 137, + "title": "qui sit non", + "completed": false + }, + { + "userId": 7, + "id": 138, + "title": "placeat minima consequatur rem qui ut", + "completed": true + }, + { + "userId": 7, + "id": 139, + "title": "consequatur doloribus id possimus voluptas a voluptatem", + "completed": false + }, + { + "userId": 7, + "id": 140, + "title": "aut consectetur in blanditiis deserunt quia sed laboriosam", + "completed": true + }, + { + "userId": 8, + "id": 141, + "title": "explicabo consectetur debitis voluptates quas quae culpa rerum non", + "completed": true + }, + { + "userId": 8, + "id": 142, + "title": "maiores accusantium architecto necessitatibus reiciendis ea aut", + "completed": true + }, + { + "userId": 8, + "id": 143, + "title": "eum non recusandae cupiditate animi", + "completed": false + }, + { + "userId": 8, + "id": 144, + "title": "ut eum exercitationem sint", + "completed": false + }, + { + "userId": 8, + "id": 145, + "title": "beatae qui ullam incidunt voluptatem non nisi aliquam", + "completed": false + }, + { + "userId": 8, + "id": 146, + "title": "molestiae suscipit ratione nihil odio libero impedit vero totam", + "completed": true + }, + { + "userId": 8, + "id": 147, + "title": "eum itaque quod reprehenderit et facilis dolor autem ut", + "completed": true + }, + { + "userId": 8, + "id": 148, + "title": "esse quas et quo quasi exercitationem", + "completed": false + }, + { + "userId": 8, + "id": 149, + "title": "animi voluptas quod perferendis est", + "completed": false + }, + { + "userId": 8, + "id": 150, + "title": "eos amet tempore laudantium fugit a", + "completed": false + }, + { + "userId": 8, + "id": 151, + "title": "accusamus adipisci dicta qui quo ea explicabo sed vero", + "completed": true + }, + { + "userId": 8, + "id": 152, + "title": "odit eligendi recusandae doloremque cumque non", + "completed": false + }, + { + "userId": 8, + "id": 153, + "title": "ea aperiam consequatur qui repellat eos", + "completed": false + }, + { + "userId": 8, + "id": 154, + "title": "rerum non ex sapiente", + "completed": true + }, + { + "userId": 8, + "id": 155, + "title": "voluptatem nobis consequatur et assumenda magnam", + "completed": true + }, + { + "userId": 8, + "id": 156, + "title": "nam quia quia nulla repellat assumenda quibusdam sit nobis", + "completed": true + }, + { + "userId": 8, + "id": 157, + "title": "dolorem veniam quisquam deserunt repellendus", + "completed": true + }, + { + "userId": 8, + "id": 158, + "title": "debitis vitae delectus et harum accusamus aut deleniti a", + "completed": true + }, + { + "userId": 8, + "id": 159, + "title": "debitis adipisci quibusdam aliquam sed dolore ea praesentium nobis", + "completed": true + }, + { + "userId": 8, + "id": 160, + "title": "et praesentium aliquam est", + "completed": false + }, + { + "userId": 9, + "id": 161, + "title": "ex hic consequuntur earum omnis alias ut occaecati culpa", + "completed": true + }, + { + "userId": 9, + "id": 162, + "title": "omnis laboriosam molestias animi sunt dolore", + "completed": true + }, + { + "userId": 9, + "id": 163, + "title": "natus corrupti maxime laudantium et voluptatem laboriosam odit", + "completed": false + }, + { + "userId": 9, + "id": 164, + "title": "reprehenderit quos aut aut consequatur est sed", + "completed": false + }, + { + "userId": 9, + "id": 165, + "title": "fugiat perferendis sed aut quidem", + "completed": false + }, + { + "userId": 9, + "id": 166, + "title": "quos quo possimus suscipit minima ut", + "completed": false + }, + { + "userId": 9, + "id": 167, + "title": "et quis minus quo a asperiores molestiae", + "completed": false + }, + { + "userId": 9, + "id": 168, + "title": "recusandae quia qui sunt libero", + "completed": false + }, + { + "userId": 9, + "id": 169, + "title": "ea odio perferendis officiis", + "completed": true + }, + { + "userId": 9, + "id": 170, + "title": "quisquam aliquam quia doloribus aut", + "completed": false + }, + { + "userId": 9, + "id": 171, + "title": "fugiat aut voluptatibus corrupti deleniti velit iste odio", + "completed": true + }, + { + "userId": 9, + "id": 172, + "title": "et provident amet rerum consectetur et voluptatum", + "completed": false + }, + { + "userId": 9, + "id": 173, + "title": "harum ad aperiam quis", + "completed": false + }, + { + "userId": 9, + "id": 174, + "title": "similique aut quo", + "completed": false + }, + { + "userId": 9, + "id": 175, + "title": "laudantium eius officia perferendis provident perspiciatis asperiores", + "completed": true + }, + { + "userId": 9, + "id": 176, + "title": "magni soluta corrupti ut maiores rem quidem", + "completed": false + }, + { + "userId": 9, + "id": 177, + "title": "et placeat temporibus voluptas est tempora quos quibusdam", + "completed": false + }, + { + "userId": 9, + "id": 178, + "title": "nesciunt itaque commodi tempore", + "completed": true + }, + { + "userId": 9, + "id": 179, + "title": "omnis consequuntur cupiditate impedit itaque ipsam quo", + "completed": true + }, + { + "userId": 9, + "id": 180, + "title": "debitis nisi et dolorem repellat et", + "completed": true + }, + { + "userId": 10, + "id": 181, + "title": "ut cupiditate sequi aliquam fuga maiores", + "completed": false + }, + { + "userId": 10, + "id": 182, + "title": "inventore saepe cumque et aut illum enim", + "completed": true + }, + { + "userId": 10, + "id": 183, + "title": "omnis nulla eum aliquam distinctio", + "completed": true + }, + { + "userId": 10, + "id": 184, + "title": "molestias modi perferendis perspiciatis", + "completed": false + }, + { + "userId": 10, + "id": 185, + "title": "voluptates dignissimos sed doloribus animi quaerat aut", + "completed": false + }, + { + "userId": 10, + "id": 186, + "title": "explicabo odio est et", + "completed": false + }, + { + "userId": 10, + "id": 187, + "title": "consequuntur animi possimus", + "completed": false + }, + { + "userId": 10, + "id": 188, + "title": "vel non beatae est", + "completed": true + }, + { + "userId": 10, + "id": 189, + "title": "culpa eius et voluptatem et", + "completed": true + }, + { + "userId": 10, + "id": 190, + "title": "accusamus sint iusto et voluptatem exercitationem", + "completed": true + }, + { + "userId": 10, + "id": 191, + "title": "temporibus atque distinctio omnis eius impedit tempore molestias pariatur", + "completed": true + }, + { + "userId": 10, + "id": 192, + "title": "ut quas possimus exercitationem sint voluptates", + "completed": false + }, + { + "userId": 10, + "id": 193, + "title": "rerum debitis voluptatem qui eveniet tempora distinctio a", + "completed": true + }, + { + "userId": 10, + "id": 194, + "title": "sed ut vero sit molestiae", + "completed": false + }, + { + "userId": 10, + "id": 195, + "title": "rerum ex veniam mollitia voluptatibus pariatur", + "completed": true + }, + { + "userId": 10, + "id": 196, + "title": "consequuntur aut ut fugit similique", + "completed": true + }, + { + "userId": 10, + "id": 197, + "title": "dignissimos quo nobis earum saepe", + "completed": true + }, + { + "userId": 10, + "id": 198, + "title": "quis eius est sint explicabo", + "completed": true + }, + { + "userId": 10, + "id": 199, + "title": "numquam repellendus a magnam", + "completed": true + }, + { + "userId": 10, + "id": 200, + "title": "ipsam aperiam voluptates qui", + "completed": false + } +] \ No newline at end of file diff --git a/docs/data/json/json_functions.md b/docs/data/json/json_functions.md index 6b18cd9a34..1fc1491f59 100644 --- a/docs/data/json/json_functions.md +++ b/docs/data/json/json_functions.md @@ -15,7 +15,10 @@ These functions supports the same two location notations as [JSON Scalar functio | `json_extract_string(json, path)` | `json_extract_path_text` | `->>` | Extracts `VARCHAR` from `json` at the given `path`. If `path` is a `LIST`, the result will be a `LIST` of `VARCHAR`. | | `json_value(json, path)` | | | Extracts `JSON` from `json` at the given `path`. If the `json` at the supplied path is not a scalar value, it will return `NULL`. | -Note that the equality comparison operator (`=`) has a higher precedence than the `->` JSON extract operator. Therefore, surround the uses of the `->` operator with parentheses when making equality comparisons. For example: +Note that the arrow operator `->`, which is used for JSON extracts, has a low precedence as it is also used in [lambda functions]({% link docs/sql/functions/lambda.md %}). + +Therefore, you need to surround the `->` operator with parentheses when expressing operations such as equality comparisons (`=`). +For example: ```sql SELECT ((JSON '{"field": 42}')->'field') = 42; diff --git a/docs/data/json/loading_json.md b/docs/data/json/loading_json.md index 2990700014..a813a3fa10 100644 --- a/docs/data/json/loading_json.md +++ b/docs/data/json/loading_json.md @@ -93,6 +93,21 @@ SELECT * FROM read_ndjson_objects('*.json.gz'); {"duck":43,"goose":[4,5,6],"swan":3.3} ``` +-- + +add columns for parameters + + + +read_json vs read_ndjson +read_*_objects vs vanilla reads + + +todo: add `map_inference_threshold` and `field_appearance_threshold` + +-- + + DuckDB also supports reading JSON as a table, using the following functions: | Function | Description | diff --git a/docs/data/json/overview.md b/docs/data/json/overview.md index 3e523a2bd4..910c8d31f1 100644 --- a/docs/data/json/overview.md +++ b/docs/data/json/overview.md @@ -10,6 +10,21 @@ DuckDB supports SQL functions that are useful for reading values from existing J JSON is supported with the `json` extension which is shipped with most DuckDB distributions and is auto-loaded on first use. If you would like to install or load it manually, please consult the [“Installing and Loading” page]({% link docs/data/json/installing_and_loading.md %}). + +TODO +duckdb implements several interfaces for JSON extraction + +[JSONPath](https://goessner.net/articles/JsonPath/), +[JSON Pointer](https://datatracker.ietf.org/doc/html/rfc6901) + +we support these both with the arrow operator and the `json_extract` function call + +we use the PostgreSQL syntax, some functions from SQLite, and a few functions from other SQL systems + +list extract also works but it's 0-based + +dot syntax (`.`) + ## About JSON JSON is an open standard file format and data interchange format that uses human-readable text to store and transmit data objects consisting of attribute–value pairs and arrays (or other serializable values). @@ -21,19 +36,35 @@ While it is not a very efficient format for tabular data, it is very commonly us ## Examples +The following examples use [`todos.json`](https://duckdb.org/data/todos.json) generated by [JSONPlaceHolder](https://jsonplaceholder.typicode.com/). + ### Loading JSON ++ shredding: json object to row, fields to column + +```sql +FROM read_json('todos.json'); +``` + +`records = false`: no shredding but inference + + + +`read_json_objects` keeps things as-is +`read_json` shreds + + Read a JSON file from disk, auto-infer options: ```sql -SELECT * FROM 'todos.json'; +SELECT * FROM 'https://duckdb.org/data/https://duckdb.org/data/todos.json'; ``` Use the `read_json` function with custom options: ```sql SELECT * -FROM read_json('todos.json', +FROM read_json('https://duckdb.org/data/todos.json', format = 'array', columns = {userId: 'UBIGINT', id: 'UBIGINT', @@ -44,21 +75,21 @@ FROM read_json('todos.json', Read a JSON file from stdin, auto-infer options: ```bash -cat data/json/todos.json | duckdb -c "SELECT * FROM read_json('/dev/stdin')" +curl https://duckdb.org/data/todos.json | duckdb -c "SELECT * FROM read_json('/dev/stdin')" ``` Read a JSON file into a table: ```sql CREATE TABLE todos (userId UBIGINT, id UBIGINT, title VARCHAR, completed BOOLEAN); -COPY todos FROM 'todos.json'; +COPY todos FROM 'https://duckdb.org/data/todos.json'; ``` Alternatively, create a table without specifying the schema manually with a [`CREATE TABLE ... AS SELECT` clause]({% link docs/sql/statements/create_table.md %}#create-table--as-select-ctas): ```sql CREATE TABLE todos AS - SELECT * FROM 'todos.json'; + SELECT * FROM 'https://duckdb.org/data/todos.json'; ``` ### Writing JSON @@ -66,7 +97,7 @@ CREATE TABLE todos AS Write the result of a query to a JSON file: ```sql -COPY (SELECT * FROM todos) TO 'todos.json'; +COPY (SELECT * FROM todos) TO 'https://duckdb.org/data/todos.json'; ``` ### JSON Data Type