Feature: read_parquet_mergetree #13

akvlad · 2024-09-30T15:21:18Z

read_parquet_mergetree

Description

The read_parquet_mergetree chsql function provides a familiar interface for ClickHouse users by emulating aspects of the MergeTree engine strategy. Its primary purpose is to efficiently merge multiple parquet files using a specified primary SORT key - without consuming excessive memory and facilitating fast range queries on the resulting file.

TLDR; A memory efficient parquet file merge/compact feature with sorting capabilities.

Syntax

COPY (SELECT * FROM read_parquet_mergetree([PARQUET_FILE_ARRAY], {PRIMARY_SORT_KEY} )) TO `{MERGED.PARQUET}`

Features

Merge data from multiple files, similar to how ClickHouse combines data from different parts
Use a specified sort key to order data, analogous to the primary key in ClickHouse MergeTree tables
Maintain sorted order within the merged dataset, facilitating fast range queries
Support glob patterns and wildcards in file array

Parameters

FILE_ARRAY[]: An array of file paths to merge
PRIMARY_SORT_KEY: Specifies the column(s) used as the primary sort key for merging and ordering data

Benchmark

COPY (SELECT * FROM read_parquet(['/folder/*.parquet']) ORDER BY some_key) TO 'sorted.parquet'
// USAGE: ~64GB RAM

COPY (SELECT * FROM read_parquet_mergetree(['/folder/*.parquet'], 'some_key') TO 'sorted.parquet'
// USAGE: ~800MB RAM

# Conflicts: # CMakeLists.txt # chsql/src/default_table_functions.cpp # src/chsql_extension.cpp

lmangani · 2024-10-01T09:16:20Z

Hey @carlopi any chance you or someone in the team knows how to get around the windows build error? 🙏

carlopi · 2024-10-01T09:38:04Z

Can you try to reduce the diff?

Or try to copy the setup of extensions like duckdb_delta.

akvlad · 2024-10-01T10:17:22Z

@carlopi Is it enough if I tell you that the real change is only in the file https://github.com/lmangani/duckdb-extension-clickhouse-sql/pull/13/files#diff-c5bffd6b887e2ced50224f44652dab784c9c7f7ab8c46a390410cc58490391ed ?

The other changes are just internal insignificant file moves.

Or do you need a separate PR with the function implementation?

carlopi · 2024-10-01T10:26:24Z

Then it's likely either a #pragma once is needed in chsql.hpp or maybe Chsql::Name & co can stay in the main header with the main extension mechanics, and the rest of the function registration should be moved to a secondary header.

akvlad · 2024-10-01T10:33:01Z

@carlopi Aaah . It's about the windows build problem.

From the MSVC++ linker logs I see that somehow the linker wants to link chsql_extension.obj file more than once:

chsql_extension.lib(chsql_extension.obj) : error LNK2005: "public: virtual void __cdecl duckdb::ChsqlExtension::Load(class duckdb::DuckDB &)" (?Load@ChsqlExtension@duckdb@@UEAAXAEAVDuckDB@2@@Z) already defined in chsql_extension.obj [D:\a\duckdb-extension-clickhouse-sql\duckdb-extension-clickhouse-sql\build\release\extension\chsql\chsql_loadable_extension.vcxproj]

chsql_extension.lib(chsql_extension.obj )error LNK2005:  .... already defined in chsql_extension.obj

Have no idea why it wants to link the same .obj twice. Have you encountered the similar problem anywhere?

lmangani · 2024-10-07T23:00:41Z

screenshot of @akvlad kicking the windows builder where it hurts 😄

lmangani · 2024-10-11T13:20:07Z

amazing work @akvlad lets merge and proceed with some field testing 🎉

akvlad and others added 8 commits September 27, 2024 18:45

parquet ordered scan feature;

425ba35

debug

07b5b50

Merge branch 'main' into feature/parquet_ordered_scan

cd18c83

# Conflicts: # CMakeLists.txt # chsql/src/default_table_functions.cpp # src/chsql_extension.cpp

merge

d487afa

Merge branch 'main' into feature/parquet_ordered_scan

fd67326

Update MainDistributionPipeline.yml

56edda8

Update MainDistributionPipeline.yml

8566e2b

Fix pipeline v1.1.1

6b4a439

lmangani changed the title ~~WIP Feature/parquet ordered scan~~ WIP Feature: read_mergetree Sep 30, 2024

akvlad and others added 3 commits September 30, 2024 21:17

debug

9937c54

very bad hack to support windows

da7fe81

Merge branch 'main' into feature/parquet_ordered_scan

c25655e

lmangani and others added 2 commits October 2, 2024 10:56

Merge branch 'main' into feature/parquet_ordered_scan

d9cf819

debug chsql windows build

4c68a3d

lmangani changed the title ~~WIP Feature: read_mergetree~~ WIP Feature: read_parquet_mergetree Oct 8, 2024

lmangani and others added 4 commits October 8, 2024 12:54

Rename function to read_parquet_mergetree

2ccd043

read_parquet_mergetree test

596ce0d

fix tests

519e6a3

move vcpkg

73bbd0a

lmangani changed the title ~~WIP Feature: read_parquet_mergetree~~ Feature: read_parquet_mergetree Oct 8, 2024

glob support

8ed51ea

lmangani merged commit 751d1a4 into main Oct 11, 2024
46 checks passed

lmangani deleted the feature/parquet_ordered_scan branch October 11, 2024 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature: read_parquet_mergetree #13

Feature: read_parquet_mergetree #13

Uh oh!

akvlad commented Sep 30, 2024 •

edited by lmangani

Loading

Uh oh!

lmangani commented Oct 1, 2024

Uh oh!

carlopi commented Oct 1, 2024

Uh oh!

akvlad commented Oct 1, 2024 •

edited

Loading

Uh oh!

carlopi commented Oct 1, 2024

Uh oh!

akvlad commented Oct 1, 2024 •

edited

Loading

Uh oh!

lmangani commented Oct 7, 2024

Uh oh!

lmangani commented Oct 11, 2024

Uh oh!

Uh oh!

Uh oh!

Feature: read_parquet_mergetree #13

Feature: read_parquet_mergetree #13

Uh oh!

Conversation

akvlad commented Sep 30, 2024 • edited by lmangani Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

read_parquet_mergetree

Description

Syntax

Features

Parameters

Benchmark

Uh oh!

lmangani commented Oct 1, 2024

Uh oh!

carlopi commented Oct 1, 2024

Uh oh!

akvlad commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

carlopi commented Oct 1, 2024

Uh oh!

akvlad commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lmangani commented Oct 7, 2024

Uh oh!

lmangani commented Oct 11, 2024

Uh oh!

Uh oh!

Uh oh!

akvlad commented Sep 30, 2024 •

edited by lmangani

Loading

akvlad commented Oct 1, 2024 •

edited

Loading

akvlad commented Oct 1, 2024 •

edited

Loading