Tracking: Load very large dataset into databend #7444

BohuTANG · 2022-09-02T01:42:47Z

Summary

Tasks:

Flow:

kafka/others sink to s3
A cron copies all the files from s3 to databend.
databend check the file has loaded or not, if loaded just skip, if not, load it and write the meta to metasrv, after load successful purge the file in the s3.
The cron runs again.

flaneur2020 · 2022-09-02T02:22:28Z

we can consider make the cron in an external tool to focus on the file discovery in s3 directory.

there're different ways on the file discovery task:

LIST by incremental (using the start-from in s3 LIST API in the lexicographic order)
- it's easy to config and no other dependency, but might produce some cost on polling LIST api and takes longer to discover the new files (in minutes level)
use the s3 notification by SQS
- it can discover the new files in a quicker manner, but have more complex configurations and require a seperate dependency on SQS

xudong963 · 2022-10-26T08:06:58Z

FYI, I'm testing tpch performance on our cloud but block at copy data into table.

copy into lineitem from @tpch_data files=('tpch100_single/lineitem.tbl') file_format=(type='CSV' field_delimiter='|' record_delimiter='\n');

The specific situation is that lineitem.tbl contains 76G data and can't copy data into lineitem table.

The above SQL return success, but no data is inserted. (It's better to return error?)

cc @BohuTANG

I don't know why can't even one row of data be inserted.

I think We should make sure that the large data can be copied in first, even if it's slow.

BohuTANG · 2022-10-28T03:55:28Z

I think there are some errors but not return to the client.

You can try:
copy into lineitem from @tpch_data files=('tpch100_single/lineitem.tbl') file_format=(type='CSV' field_delimiter='|' record_delimiter='\n') force=true;

BohuTANG · 2024-06-01T10:48:05Z

All works great now.

BohuTANG pinned this issue Sep 2, 2022

BohuTANG mentioned this issue Sep 13, 2022

Tracking: Databend as Lakehouse #7592

Open

9 tasks

BohuTANG unpinned this issue Sep 13, 2022

BohuTANG closed this as completed Jun 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking: Load very large dataset into databend #7444

Tracking: Load very large dataset into databend #7444

BohuTANG commented Sep 2, 2022 •

edited

Loading

flaneur2020 commented Sep 2, 2022 •

edited

Loading

xudong963 commented Oct 26, 2022 •

edited

Loading

BohuTANG commented Oct 28, 2022

BohuTANG commented Jun 1, 2024

Tracking: Load very large dataset into databend #7444

Tracking: Load very large dataset into databend #7444

Comments

BohuTANG commented Sep 2, 2022 • edited Loading

flaneur2020 commented Sep 2, 2022 • edited Loading

xudong963 commented Oct 26, 2022 • edited Loading

BohuTANG commented Oct 28, 2022

BohuTANG commented Jun 1, 2024

BohuTANG commented Sep 2, 2022 •

edited

Loading

flaneur2020 commented Sep 2, 2022 •

edited

Loading

xudong963 commented Oct 26, 2022 •

edited

Loading