Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Release proposal: Nightly v1.0 #9604

Closed
5 tasks done
Tracked by #9448 ...
BohuTANG opened this issue Jan 15, 2023 · 9 comments
Closed
5 tasks done
Tracked by #9448 ...

Release proposal: Nightly v1.0 #9604

BohuTANG opened this issue Jan 15, 2023 · 9 comments
Labels
roadmap-track Roadmap track issues Tracking

Comments

@BohuTANG
Copy link
Member

BohuTANG commented Jan 15, 2023

Summary

Release name: v1.0-nightly, get on the train now ✋
Let's make the Databend more Lakehouse!

v1.0 (Prepare for release on March 5th)

Task Status Comments
(Query) Support Decimal data type#2931 DONE high-priority(release in v1.0 )
(Query) Query external stage file(parquet)#9847 DONE high-priority(release in v1.0)
(Query) Array functions#7931 DONE high-priority(release in v1.0)
(Query) Query Result Cache#10010 DONE high-priority(release in v1.0)
(Planner) CBO#9597 DONE high-priority(release in v1.0)
(Processor) Aggregation spilling#10273 DONE high-priority(release in v1.0)
(Storage) Alter table#9441 DONE high-priority(release in v1.0 )
(Storage) Block data cache#9772 DONE high-priority(release in v1.0 )

Archive releases

Reference

What are Databend release channels?
Nightly v1.0 is part of our Roadmap 2023
Community website: https://databend.rs

@BohuTANG BohuTANG added roadmap-track Roadmap track issues Tracking labels Jan 15, 2023
@BohuTANG BohuTANG pinned this issue Jan 15, 2023
@xudong963
Copy link
Member

Is there an expected time to release v1.0?

@BohuTANG
Copy link
Member Author

Is there an expected time to release v1.0?

The preliminary plan is to release in March, mainly focusing on alter table, update, and group by spill.

@tangguoqiang172528725
Copy link

Hope simplify the way to insert data, it will help get more user.

@BohuTANG
Copy link
Member Author

Add Query Result Cache#10010

@haydenflinner
Copy link

Hope simplify the way to insert data, it will help get more user.
It's already the easiest to insert of all of the similar products I've tried, how would you like to insert?

@BohuTANG Are there any plans for higher-performance client reads, like maybe streaming Arrow/Parquet/some other high-perf format? I'm not familiar with other read protocols like for example ClickHouse's, I've just been using the mysql connector. But it would be neat to be able to have databend in the middle while paying little overhead vs reading the raw parquet files from S3.

@BohuTANG
Copy link
Member Author

@haydenflinner

But it would be neat to be able to have databend in the middle while paying little overhead vs reading the raw parquet files from S3.

Databend supports the suffix an ignore_result to ignore the result from server to client by MySQL wired protocol.

For example:

select * from hits limit 20000;

20000 rows in set (0.53 sec)
Read 146370 rows, 101.91 MiB in 0.507 sec., 288.51 thousand rows/sec., 200.88 MiB/sec.

With ignore_result(Not send result to client):

mysql> select * from hits limit 20000 ignore_result;
Empty set (0.26 sec)
Read 146370 rows, 101.91 MiB in 0.236 sec., 619.37 thousand rows/sec., 431.24 MiB/sec.

@haydenflinner
Copy link

@BohuTANG That is neat and confirms my suspicion that MySql protocol is a bottleneck in some usecases. Parquet read speeds are in the GB/s, but even by telling the mysql client not to handle the result, we get only MB/s. This confirms the results in the paper I linked, see "Postgres++" in the final table of results vs "Postgres".

If one wanted to use databend as a simple intermediary between dataframes and s3 (more lake-house style), databend is providing a lot of value still in interactive query handling, file size and metadata mgmt, far simpler interface, etc. But it presents a bottleneck when it comes to raw-read-speed. If I wanted to do this for example: df = pd.read_sql("select * from hits limit 1000000"), that would be I think 10x slower than df = pd.read_parquet("local-download-of-hits.parquet"). But I suspect primarily due to mysql protocol overhead; the rest of databend is so fast I wouldn't expect it to get in the way much. I can file a ticket for this, don't let me derail the 1.0 thread, sorry 😄

@haydenflinner
Copy link

I believe the modern open source protocol most similar to what that paper describes is "Apache Arrow Flight"

@sundy-li
Copy link
Member

sundy-li commented Feb 25, 2023

I believe the modern open source protocol most similar to what that paper describes is "Apache Arrow Flight"

Yes, we have plan to do this in #9832.

If the query result is small, MySQL client could work as normal since OLTP data result will commonly be small so it's ok.

Otherwise, we should use other formats or protocols to handle large output (MySQL client is really bad in this case)

You can use:

  1. Unload command to upload the data in parquet/csv formats into storage. https://databend.rs/doc/unload-data/
  2. HTTP/ClickHouse handler to export the data
curl 'http://default@localhost:8124/' --data-binary "select 1,2,3 from numbers(3) format TSV"
  1. Wait for the flight SQL feature, that's called native client!

This paper did not cover clickhouse-client. But AFAIK, clickhouse-client is the best client/protocol I ever see.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
roadmap-track Roadmap track issues Tracking
Projects
None yet
Development

No branches or pull requests

5 participants