Tracking issues of OpenDAL API changes #356

Xuanwo · 2024-04-27T10:10:09Z

Hi, iceberger! OpenDAL's coming v0.46 release will have API changes that could affect our project.

New Features

OpenDAL Reader now has concurrent support

OpenDAL Reader now has concurrent support that can read multiple chunks concurrently.

let r = op.reader_with("test.txt").concurrent(4).chunk(4 * 1024 * 1024).await?;
let buf = r.read(0..16 * 1024 * 1024).await?;

The buf here will be fetched in 4 concurrent requests.

To read non-contiguous buffers, please use our fetch API:

let r = op.reader_with("test.txt").concurrent(4).chunk(4 * 1024 * 1024).await?;
let bufs = r.fetch(vec![0..1024 * 1024, 1024..3 * 1024 * 1024]).await?;

OpenDAL will merge close ranges and read them concurrently.

The detailed upgrade guide could be found here.

OpenDAL v0.46 is not related yet so those changes are still possible to be altered. I will try my best to keep this issue update.

API Changes

I list the major changes that we need to take care:

OpenDAL Reader doens't impl `AsyncRead + AsyncSeek` anymore

OpenDAL's Reader now transformed into range based read.

let r = op.reader("test.txt").await?;
let buf = r.read(1024..2048).await?;

Users can transform into AsyncRead + AsyncSeek by using into_futures_async_read:

let r = op.reader("test.txt").await?;
let reader: FuturesAsyncReader = r.into_futures_async_read(0..4096);

But please note:

opendal::Reader adopts zero-cost abstraction, no extra bytes copy and allocation happened.
opendal::FuturesAsyncReader is the same as our old reader, it might have extra bytes copy.

OpenDAL Writer doens't impl `AsyncWrite` anymore

Just like Reader, opendal::Writer doesn't impl AsyncWrite anymore. Users could use opendal's native Buffer for both contiguous and non-contiguous buffers support.

let w = op.writer("test.txt").await?;

// Buffer can be created from continues bytes.
w.write("hello, world").await?;
// Buffer can also be created from non-continues bytes.
w.write(vec![Bytes::from("hello,"), Bytes::from("world!")]).await?;

// Make sure file has been written completely.
w.close().await?;

Users can transform into AsyncWrite by using into_futures_async_write:

let w = op.writer("test.txt").await?;
let writer: FuturesAsyncWriter = r.into_futures_async_write();

Tasks

Although it's possible to simply convert opendal's Reader and Writer into AsyncXxx-based structures, I aim to prepare Iceberg for the most efficient IO methods. In the near future, we will support compilation-based IO and vectorization. The AsyncXxx-based traits do not integrate well with these methods.

feat: Extract FileRead and FileWrite trait #364
Introduce concurrent read support
Introduce concurrent write support

I believe only read side needs to do some changes. write side should be simple to update.

I agree that we should refactor the Input/Output API for best performance, but we should be careful not to expose opendal's api to end user.

Yep, I have a plan to refactor the FileIO as a trait so that users can provide their own implemations via opendal or object_store.

I have met a use case that users will parse storage related input by themselves and don't uses iceberg's own args.

I'm ok with this change, but it's breaking. Let's see what others think. cc @Fokko @sdd @marvinlanhenke @tustvold

tustvold · 2024-04-28T13:39:41Z

I'm not seeing a concrete proposal on which to comment, but the broad theme of moving closer to the actual object store APIs is IMO a good idea.

That being said, I feel I should point out that such an abstraction already exists, is natively supported by both the Rust arrow and Datafusion projects, and already has OpenDAL bindings, namely object_store... Perhaps this is therefore already a solved problem?

I can understand why Iceberg Java felt the need to provide FileIO, as the ecosystem there is still very wedded to filesystem APIs, but the Rust ecosystem has largely avoided this mistake?

I dunno I am far from impartial, but the motivation for object_store is to provide this exact abstraction

marvinlanhenke · 2024-04-28T14:07:13Z

I'm not seeing a concrete proposal on which to comment, but the broad theme of moving closer to the actual object store APIs is IMO a good idea.

That being said, I feel I should point out that such an abstraction already exists, is natively supported by both the Rust arrow and Datafusion projects, and already has OpenDAL bindings, namely object_store... Perhaps this is therefore already a solved problem?

...with the disclaimer, that I have little to no experience regarding the intricacies of object store APIs;

I (also) think we should avoid "reinventing the wheel" here and use already existing solutions; with that in mind if FileIO becomes a trait and we can use an existing implementation e.g. opendal or object_store - I have little to no worries about this proposed change...

...hope this makes any sense at all.

tustvold · 2024-04-28T14:09:39Z

if FileIO becomes a trait and we can use an existing implementation

I guess I was suggesting to just not have a FileIO trait, and just use ObjectStore which is already an object-safe trait

liurenjie1024 · 2024-04-28T14:58:59Z

I think a FileIO trait would still be necessary for two reasons:

It's more familiar to iceberg community, which makes this easier to maintain and align with other implementations such as java/python.
Unlike object_store which provides direct interface to storage, FileIO could be chained, for example when we need to deal with encryption. This way we can hide complexity from end user.

Xuanwo · 2024-05-19T14:41:03Z

Closing now. I thought we would have huge breaking API changes, but it turned out to be a simple, small PR. I'll track the optimization in other issues instead.

Xuanwo mentioned this issue May 7, 2024

feat: Extract FileRead and FileWrite trait #364

Merged

Xuanwo closed this as completed May 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking issues of OpenDAL API changes #356

Tracking issues of OpenDAL API changes #356

Xuanwo commented Apr 27, 2024 •

edited

Loading

Xuanwo commented Apr 27, 2024

liurenjie1024 commented Apr 28, 2024

liurenjie1024 commented Apr 28, 2024

Xuanwo commented Apr 28, 2024 •

edited

Loading

liurenjie1024 commented Apr 28, 2024

tustvold commented Apr 28, 2024 •

edited

Loading

marvinlanhenke commented Apr 28, 2024

tustvold commented Apr 28, 2024

liurenjie1024 commented Apr 28, 2024

Xuanwo commented May 19, 2024

Tracking issues of OpenDAL API changes #356

Tracking issues of OpenDAL API changes #356

Comments

Xuanwo commented Apr 27, 2024 • edited Loading

New Features

OpenDAL Reader now has concurrent support

API Changes

OpenDAL Reader doens't impl AsyncRead + AsyncSeek anymore

OpenDAL Writer doens't impl AsyncWrite anymore

Tasks

Related

Xuanwo commented Apr 27, 2024

liurenjie1024 commented Apr 28, 2024

liurenjie1024 commented Apr 28, 2024

Xuanwo commented Apr 28, 2024 • edited Loading

liurenjie1024 commented Apr 28, 2024

tustvold commented Apr 28, 2024 • edited Loading

marvinlanhenke commented Apr 28, 2024

tustvold commented Apr 28, 2024

liurenjie1024 commented Apr 28, 2024

Xuanwo commented May 19, 2024

Xuanwo commented Apr 27, 2024 •

edited

Loading

OpenDAL Reader doens't impl `AsyncRead + AsyncSeek` anymore

OpenDAL Writer doens't impl `AsyncWrite` anymore

Xuanwo commented Apr 28, 2024 •

edited

Loading

tustvold commented Apr 28, 2024 •

edited

Loading