Skip to content

Feature: read_parquet_mergetree #13

New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Merged
merged 18 commits into from
Oct 11, 2024
Merged

Conversation

akvlad
Copy link
Collaborator

@akvlad akvlad commented Sep 30, 2024

read_parquet_mergetree

Description

The read_parquet_mergetree chsql function provides a familiar interface for ClickHouse users by emulating aspects of the MergeTree engine strategy. Its primary purpose is to efficiently merge multiple parquet files using a specified primary SORT key - without consuming excessive memory and facilitating fast range queries on the resulting file.

TLDR; A memory efficient parquet file merge/compact feature with sorting capabilities.

Syntax

COPY (SELECT * FROM read_parquet_mergetree([PARQUET_FILE_ARRAY], {PRIMARY_SORT_KEY} )) TO `{MERGED.PARQUET}`

Features

  • Merge data from multiple files, similar to how ClickHouse combines data from different parts
  • Use a specified sort key to order data, analogous to the primary key in ClickHouse MergeTree tables
  • Maintain sorted order within the merged dataset, facilitating fast range queries
  • Support glob patterns and wildcards in file array

Parameters

  • FILE_ARRAY[]: An array of file paths to merge
  • PRIMARY_SORT_KEY: Specifies the column(s) used as the primary sort key for merging and ordering data
Benchmark
COPY (SELECT * FROM read_parquet(['/folder/*.parquet']) ORDER BY some_key) TO 'sorted.parquet'
// USAGE: ~64GB RAM
COPY (SELECT * FROM read_parquet_mergetree(['/folder/*.parquet'], 'some_key') TO 'sorted.parquet'
// USAGE: ~800MB RAM

@lmangani lmangani changed the title WIP Feature/parquet ordered scan WIP Feature: read_mergetree Sep 30, 2024
@lmangani
Copy link
Collaborator

lmangani commented Oct 1, 2024

Hey @carlopi any chance you or someone in the team knows how to get around the windows build error? 🙏

@carlopi
Copy link

carlopi commented Oct 1, 2024

Can you try to reduce the diff?

Or try to copy the setup of extensions like duckdb_delta.

@akvlad
Copy link
Collaborator Author

akvlad commented Oct 1, 2024

@carlopi Is it enough if I tell you that the real change is only in the file https://github.com/lmangani/duckdb-extension-clickhouse-sql/pull/13/files#diff-c5bffd6b887e2ced50224f44652dab784c9c7f7ab8c46a390410cc58490391ed ?

The other changes are just internal insignificant file moves.

Or do you need a separate PR with the function implementation?

@carlopi
Copy link

carlopi commented Oct 1, 2024

Then it's likely either a #pragma once is needed in chsql.hpp or maybe Chsql::Name & co can stay in the main header with the main extension mechanics, and the rest of the function registration should be moved to a secondary header.

@akvlad
Copy link
Collaborator Author

akvlad commented Oct 1, 2024

@carlopi Aaah . It's about the windows build problem.

From the MSVC++ linker logs I see that somehow the linker wants to link chsql_extension.obj file more than once:

chsql_extension.lib(chsql_extension.obj) : error LNK2005: "public: virtual void __cdecl duckdb::ChsqlExtension::Load(class duckdb::DuckDB &)" (?Load@ChsqlExtension@duckdb@@UEAAXAEAVDuckDB@2@@Z) already defined in chsql_extension.obj [D:\a\duckdb-extension-clickhouse-sql\duckdb-extension-clickhouse-sql\build\release\extension\chsql\chsql_loadable_extension.vcxproj]
chsql_extension.lib(chsql_extension.obj )error LNK2005:  .... already defined in chsql_extension.obj

Have no idea why it wants to link the same .obj twice. Have you encountered the similar problem anywhere?

@lmangani
Copy link
Collaborator

lmangani commented Oct 7, 2024

screenshot of @akvlad kicking the windows builder where it hurts 😄
image

@lmangani lmangani changed the title WIP Feature: read_mergetree WIP Feature: read_parquet_mergetree Oct 8, 2024
@lmangani lmangani changed the title WIP Feature: read_parquet_mergetree Feature: read_parquet_mergetree Oct 8, 2024
@lmangani
Copy link
Collaborator

amazing work @akvlad lets merge and proceed with some field testing 🎉

@lmangani lmangani merged commit 751d1a4 into main Oct 11, 2024
46 checks passed
@lmangani lmangani deleted the feature/parquet_ordered_scan branch October 11, 2024 16:34
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants