Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Tracking: Load very large dataset into databend #7444

Closed
4 tasks done
Tracked by #7592
BohuTANG opened this issue Sep 2, 2022 · 4 comments
Closed
4 tasks done
Tracked by #7592

Tracking: Load very large dataset into databend #7444

BohuTANG opened this issue Sep 2, 2022 · 4 comments

Comments

@BohuTANG
Copy link
Member

BohuTANG commented Sep 2, 2022

Summary

Tasks:

Flow:

  1. kafka/others sink to s3
  2. A cron copies all the files from s3 to databend.
  3. databend check the file has loaded or not, if loaded just skip, if not, load it and write the meta to metasrv, after load successful purge the file in the s3.
  4. The cron runs again.
@BohuTANG BohuTANG pinned this issue Sep 2, 2022
@flaneur2020
Copy link
Member

flaneur2020 commented Sep 2, 2022

we can consider make the cron in an external tool to focus on the file discovery in s3 directory.

there're different ways on the file discovery task:

  • LIST by incremental (using the start-from in s3 LIST API in the lexicographic order)
    • it's easy to config and no other dependency, but might produce some cost on polling LIST api and takes longer to discover the new files (in minutes level)
  • use the s3 notification by SQS
    • it can discover the new files in a quicker manner, but have more complex configurations and require a seperate dependency on SQS

@BohuTANG BohuTANG unpinned this issue Sep 13, 2022
@xudong963
Copy link
Member

xudong963 commented Oct 26, 2022

FYI, I'm testing tpch performance on our cloud but block at copy data into table.

copy into lineitem from @tpch_data files=('tpch100_single/lineitem.tbl') file_format=(type='CSV' field_delimiter='|' record_delimiter='\n');

The specific situation is that lineitem.tbl contains 76G data and can't copy data into lineitem table.

The above SQL return success, but no data is inserted. (It's better to return error?)

cc @BohuTANG

I don't know why can't even one row of data be inserted.

I think We should make sure that the large data can be copied in first, even if it's slow.

@BohuTANG
Copy link
Member Author

I think there are some errors but not return to the client.

You can try:
copy into lineitem from @tpch_data files=('tpch100_single/lineitem.tbl') file_format=(type='CSV' field_delimiter='|' record_delimiter='\n') force=true;

@BohuTANG
Copy link
Member Author

BohuTANG commented Jun 1, 2024

All works great now.

@BohuTANG BohuTANG closed this as completed Jun 1, 2024
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants