-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Protecting ArrayBuffers from the whatwg reference implementation #3
Comments
Indeed, I think I'll replace
Although we can fix this for the polyfill, I think this can still cause issues in the future when using a native readable byte stream implementation. A native implementation would actually transfer the In your The "intended" way to do zero-copy reads with a readable byte stream is to use const readable = new ReadableStream({
type: 'bytes',
async pull(controller) {
const byobRequest = controller.byobRequest;
if (byobRequest) {
// using a BYOB reader
const bytesRead = await reader.readInto(byobRequest.view);
if (bytesRead === 0) {
controller.close();
} else {
byobRequest.respond(bytesRead);
}
} else {
// using a default reader
const buf = await reader.read();
if (!buf) {
controller.close();
} else {
controller.enqueue(buf.slice()); // need to copy if `buf.buffer` could be re-used
}
}
}
}); Alternatively, you can set Some off-topic remarks:
|
Unfortunately, for this to be truly zero-copy, you would need to use BYOB readers everywhere, which would require a massive rewrite... I guess that's not really an option. 😕 Perhaps an easier option would be to always copy the chunk before enqueuing it: async function next(controller: ReadableStreamDefaultController<Uint8Array>) {
let buf = await reader.read(controller.desiredSize || null);
if (buf) {
controller.enqueue(buf.slice());
} else {
controller.close();
}
} That way, you don't have to change the rest of the code, at the cost of one copy per chunk (given that the polyfill is fixed to not copy in |
That's an even better fix! I tested that locally and it worked, but assumed failing the wpt tests was a non-starter, so didn't bother suggesting it.
Yes, that's the heart of the issue. I'm happy to conform to the semantics of whatever APIs I need to get the job done, so in my mind it's not so much about what I want (aside from ensuring zero-copy), and more about making the polyfill play nice with node.
Yes, but at the moment the only pooling that's happening is in node core, so if node ever gets a native ReadableByteStream implementation, they'll have to deal with this issue at that point anyway.
In that method
Yes, I actually do this a few steps before in The Apache Arrow project is a streaming zero-copy dataframe standard for large analytics workloads. We compile many flavors of the JS library for both node and browsers, and we want to provide a simple and consistent public API for reading/writing data regardless of which environment you're in. We're sensitive to bundle size, and would like to use the native APIs in each environment, so we accept and produce the environment's stream primitives at the edges, and adapt them to Iterables and AsyncIterables for use internally. We have a few generic io primitive impls, but they're intentionally slim wrappers that adapt the native io primitives to our Iterable/AsyncIterable protocol. We could split everything out and build special versions for each environment, but doing this means there's just one npm package to consume, maximizes code reuse (I'm the main JS contributor at the moment, and I'm stretched pretty thin), and reduces the surface area/complexity for future maintainers.
Yep, and this is where we reach the limits of our ability to cleanly bridge all these APIs. The "lowest-common-denominator" concept in all these APIs is buffer message-passing. I could bake the "byob" concept into our "bridge" primitives, but then I force a copy in node, and am ultimately just rebuilding whatwg reference impl. At the end of the day, forcing a copy anywhere (node or the browser) means we can't do things like this as flawlessly: Thanks for considering this change! Best, |
In most cases I would agree, but in this case there's no way to have an efficient implementation that passes the tests. For the tests to pass, you have to make the original I'm hoping the
Wow, TIL that If I understand correctly, a Node Yeah, it would be cool if we could do that with readable byte streams... You could almost get it to work with: outer: while (true) {
let pool = new ArrayBuffer(1024);
let offset = 0;
while (offset < pool.byteLength) {
const { done, value } = await reader.read(new Uint8Array(pool, offset));
if (done) {
break outer;
}
pool = value.buffer;
offset += value.byteLength;
this.push(value);
}
} But then the next call to
I see. Indeed, that's quite a challenge to get a common I/O API across all platforms.
That looks amazing! Good luck with the project! 👍 |
…ts, dependencies It's the big one; The Great ArrowJS Refactor of 2018. Thanks for bearing with me through yet another huge PR. [Check out this sweet gif](https://user-images.githubusercontent.com/178183/50551046-19a94d00-0c30-11e9-80ed-74b9290e8c49.gif) of all the new features in action. With streaming getting to a good place, we've already started working on demos/integrations with other projects like [uber/deck.gl](https://github.com/Pessimistress/deck.gl/tree/a5940e20cb1659a44cba7839082b0803a997a12f/test/apps/arrow) 🎉 ### The JIRAs In addition to everything I detail below, this PR closes the following JIRAs: * [ARROW-2828](https://issues.apache.org/jira/browse/ARROW-2828): Refactor Vector Data classes * [ARROW-2839](https://issues.apache.org/jira/browse/ARROW-2839): Support whatwg/streams in IPC reader/writer * [ARROW-2235](https://issues.apache.org/jira/browse/ARROW-2235): Add tests for IPC messages split across multiple buffers * [ARROW-3337](https://issues.apache.org/jira/browse/ARROW-3337): IPC writer doesn't serialize the dictionary of nested Vectors * [ARROW-3689](https://issues.apache.org/jira/browse/ARROW-3689): Upgrade to TS 3.1 * [ARROW-3560](https://issues.apache.org/jira/browse/ARROW-3560): Remove @std/esm * [ARROW-3561](https://issues.apache.org/jira/browse/ARROW-3561): Update ts-jest * [ARROW-2778](https://issues.apache.org/jira/browse/ARROW-2778): Add Utf8Vector.from * [ARROW-2766](https://issues.apache.org/jira/browse/ARROW-2766): Add ability to construct a Table from a list of Arrays/TypedArrays ### The stats The gulp scripts have been updated to parallelize as much as possible. These are the numbers from my Intel Core i7-8700K CPU @ 3.70GHz × 12 running Ubuntu 18.04 and node v11.6.0: ```sh $ time npm run build [22:11:04] Finished 'build' after 39 s real 0m40.341s user 4m55.428s sys 0m5.559s ``` ```sh $ npm run test:coverage =============================== Coverage summary =============================== Statements : 90.45% ( 4321/4777 ) Branches : 76.7% ( 1570/2047 ) Functions : 84.62% ( 1106/1307 ) Lines : 91.5% ( 3777/4128 ) ================================================================================ Test Suites: 21 passed, 21 total Tests: 5644 passed, 5644 total Snapshots: 0 total Time: 16.023s ``` ### The fixes * `Vector#indexOf(value)` works for all DataTypes * `Vector#set(i, value)` now works for all DataTypes * Reading from node streams is now fully zero-copy * The IPC writers now serialize dictionaries of nested Vectors correctly (ARROW-3337) * DictionaryBatches marked as `isDelta` now correctly updates the dictionaries for all Vectors that point to that dictionary, even if they were created before the delta batch arrived * A few `arrow2csv` fixes: * Ignore `stdin` if it's a TTY * Now read all the Arrow formats from `stdin` * Always show the `help` text when we don't understand the input * Proper backpressure support to play nicely with other Unix utilities like `head` and `less` * [Fixes an unfiled bug](trxcllnt@070ec98) we encountered last week where JS would throw an error creating RowProxies for a Table or Struct with duplicate column names ### The upgrades * New zero-copy Message/RecordBatchReaders! * [`RecordBatchReader.from()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/from-inference-tests.ts#L37) will peek at the underlying bytes, and return the correct implementation based on whether the data is an Arrow File, Stream, or JSON * [`RecordBatchFileReader`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/file-reader-tests.ts#L74) now supports random-access seek, enabling more efficient web-worker/multi-process workflows * [`RecordBatchStreamReader`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-dom-tests.ts#L119) can now read multiple tables from the same underlying socket * `MessageReader` now [guarantees/enforces](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/src/ipc/message.ts#L126) message body byte alignment (this one even surfaced bugs in [node core](nodejs/node#24817) and the [DOM streams polyfill](MattiasBuelens/web-streams-polyfill#3)) * New RecordBatchWriters * Adds RecordBatchJSONWriter, RecordBatchFileWriter and RecordBatchStreamWriter * Adds static `RecordBatchWriter.writeAll()` method to easily write a Table or stream of RecordBatches * Both sync and async flushes based on the WritableSink * Full integration with platform I/O primitives * We can still synchronously read JSON, Buffers, `Iterable<Buffer>`, or `AsyncIterable<Buffer>` * In node, we can now read from any [`ReadableStream`](https://nodejs.org/docs/latest/api/stream.html#stream_class_stream_readable), [`fs.FileHandle`](https://nodejs.org/docs/latest/api/fs.html#fs_class_filehandle) * In the browser, we can read from any [`ReadableStream` or `ReadableByteStream`](https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream), or the [`Response`](https://developer.mozilla.org/en-US/docs/Web/API/Response) returned from the `fetch()` API. (Wrapping the [FileReader](https://developer.mozilla.org/en-US/docs/Web/API/FileReader) is still todo) * We also [accept Promises](https://github.com/Pessimistress/deck.gl/blob/a5940e20cb1659a44cba7839082b0803a997a12f/test/apps/arrow/loader.js#L20) of any of the above * New convenience methods for integrating with node or DOM streams * [`throughNode()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-node-tests.ts#L54)/[`throughDOM()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-dom-tests.ts#L50) * [`toReadableNodeStream()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-node-tests.ts#L69)/[`toReadableDOMStream()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/reader/streams-dom-tests.ts#L65) * [`pipe()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/writer/streams-node-tests.ts#L91)/[`pipeTo()`/`pipeThrough()`](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/test/unit/ipc/writer/streams-dom-tests.ts#L92) * Generic type parameters inherited from `DataType` now flow recursively ```js const table = Table.from<{ str: Utf8, i32: Int32, bools: List<Bool> }>(data); table.get(0); // will be of type { str: string, i32: number, bools: BoolVector } ``` * New simplified [`Data` class](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/src/data.ts) * New simplified, faster `Visitor` class with support for optional, more narrow [`visitT` implementations](https://github.com/trxcllnt/arrow/blob/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/src/visitor.ts#L181) * New specialized Visitor implementations to enable runtime reflection (e.g. dynamically lookup the Vector constructor for a given DataType) * New abstract `Chunked` base class for the applicative (concat) operation * public `chunkedInst.chunks` field is the list of inner chunks * New `Column` class extends `Chunked`, combines `Field` with the chunks (provides access to the field `name` from the Schema) * `RecordBatch#concat(...batchesOrTables)` now returns a Table * Table now extends `Chunked`, so it inherits: * `Table#slice(from, to)` * `Table#concat(...batchesOrTables)` * `Table#getChildAt(i)` exists, alias of `getColumnAt(i)` * `Table#getColumn[At]()` returns a Column ### The breaking changes * All the old IPC functions are gone, but the new APIs will live for much longer * `Table#batches` is now `Table#chunks`, which it inherits from `Chunked` (maybe controversial, open to aliasing) * `Table#batchesUnion` is now just... the Table instance itself (also maybe controversial, open to aliasing) * `DataType#TType` is now `DataType#typeId` -- it should have always been this, was a typo. Easy to alias if necessary. * The complicated View classes are now gone, logic centralized as specialized [`Visitors`](https://github.com/trxcllnt/arrow/tree/b58e29bc83675583238bbb94fba2f3ebf8f1e4aa/js/src/visitor) ### The tests * **Tests no longer rely on any C++ or Java generated integration files** * Integration tests have been moved into `bin/integration.js`, and they finish much quicker * The tsconfig files have been tweaked to speed up test run time and improve the async debugging experience * A streaming `RecordBatchJSONWriter` has been implemented so we can easily debug and validate written output * The JSON results are also tested against the corresponding binary representation, similar to the integration tests * A [suite of test-data helpers](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/generate-test-data.ts) have been added to auto-generate data for validation at runtime * They produce the underlying Arrow VectorData buffers, as well as the expected plain-JS-value representation [for verification](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/unit/generated-data-tests.ts#L23) * This allows us to test all possible type configuration combinations, e.g. [all types Dictionary-encode](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/data/tables.ts#L61), all types serialize when nested, etc. * A [suite of IO test helpers](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/unit/ipc/helpers.ts#L36) has been added * We use [`memfs`](https://www.npmjs.com/package/memfs) to mock the file system, which contributes to test performance improvements * This enables us to [easily test](https://github.com/trxcllnt/arrow/blob/d9970bb9a6a9d80bbe07b321dc6389bccf1b0835/js/test/unit/ipc/reader/file-reader-tests.ts#L38) all the flavors of io primitives across node and browser environments * A vscode debugging launch configuration has been added to ease the process of contributing more tests (and because I've been asked for mine so often) ### The build * Faster * Node 11+ (needs `Symbol.asyncIterator` enabled) * Closure-compiler upgrades and build enhancements mean we can auto-generate the externs file during compilation, rather than maintaining it by hand ### Misc * Added `arrow2csv` to `js/bin/arrow2csv`, so anybody with the JS project dependencies installed can easily view a CSV-ish thing (`cat foo.arrow | js/bin/arrow2csv.js`) ### Todos * Docs/Recipes/Examples * Highlight/write more tools (like `arrow2csv`) * Flesh out the RecordBatchWriters a bit more * Gather feedback on the new RecordBatchReader APIs Author: ptaylor <paul.e.taylor@me.com> Author: Paul Taylor <paul.e.taylor@me.com> Closes #3290 from trxcllnt/js-data-refactor and squashes the following commits: 2ef150f <ptaylor> bind getByteWidth to the vector type 9acfaa3 <ptaylor> handle the case where collapsed Uint8Arrays fully overlap 6a97ee0 <ptaylor> perf: defer creating rowProxy on nested types, use Array instead of Object for creating Data instances 2cad760 <ptaylor> pipe directly to stdout to ensure backpressure is preserved f006a26 <ptaylor> ensure schema and field always have a metadata map 8dc5d2c <ptaylor> fix Float64 Array typings 162c7d8 <ptaylor> fix arrow2csv left-pad measurement for new bignum/decimal output 64dc015 <ptaylor> teach closure about Symbol.toPrimitive ca0db9e <ptaylor> fix lint ec12cdd <ptaylor> add a small BigNum mixin to make working with Int64 and Decimal values a bit easier 62578b9 <ptaylor> fix bug where valueToString function would return undefined (JSON.striingify(undefined) === undefined) 4b58bde <ptaylor> fix visitor method overload type signatures d165413 <ptaylor> don't print comma that includes system paths 708f1b4 <ptaylor> move stride to data, fix chunked slicing, remove intermediate binding and getters in favor of direct property accesses 78ecc4c <ptaylor> use the textencoders from the global instead of Buffer for perf testing 47f0677 <ptaylor> perf: use a closure instead of binding 380dbc7 <ptaylor> add a single-chunk column type 6bcaad6 <ptaylor> fix lint f7d2b2e <ptaylor> add getters for the dictionary and indices of chunked dictionary vectors aaf42c8 <Paul Taylor> Consolidated JS data handling refactor
The whatwg ReadableByteStreamController reference implementation attempts to mimic the
ArrayBuffer.transfer
semantics by copying the underlying ArrayBuffer of any TypedArray that passes through it, and redefining the original's byteLength to be 0:To be fair, they state that the reference implementation shouldn't be used as a polyfill. But seeing as how it is being used as a polyfill, this behavior is problematic.
Obviously copying at every step is unacceptable for a production polyfill, but redefining
byteLength
also makes it impossible to use in node (without wrapping every stream to pre-copy each chunk) for interop or testing DOM-streams logic. Node often uses internal ArrayBuffer pools (like here in fs.ReadStream), so theObject.defineProperty
call is modifying the ArrayBuffer pool, breaking unrelated operations dependent on that pool.After thoroughly reading/stepping through the code paths in the reference implementation in each place
TransferArrayBuffer
is called to verify the returned buffer isn't modified further, and verifying the behavior with about a thousand (auto-generated) tests and one production use, I've come up with a workaround that solves both these issues:Before calling
readableByteStreamController.enqueue()
, we can redefine thebuffer
property on the TypedArray to be an instance created viaObject.create()
, whose "slice" function is overridden to return the original, unmodified buffer. This ensure the value returned totransferredIshVersion = O.slice();
will be the "real" underlying ArrayBuffer, and won't be sliced.I could keep this hack in userland, but maybe adding something like this to the polyfill would be worthwhile for the following reasons:
controller.enqueue()
has to be guarded by a call toprotectArrayBufferFromWhatwgRefImpl()
, which makes writing a true zero-copy polyfill for FF that much harder.protectArrayBufferFromWhatwgRefImpl()
even in environments where theweb-streams-polyfill
isn't being used, like Chrome and Safari. Even though in my testing this wrapping doesn't affect the native stream implementations, if for some reason a user did want to callchunk.buffer.slice()
, they'd be getting the real ArrayBuffer out, not a slice of it.The text was updated successfully, but these errors were encountered: