-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Can WebAssembly.Instance recompile an already-compiled Module? #838
Comments
Would be nice for |
The specification doesn't define what "compilation" is, nor would it make On 27 October 2016 at 03:24, Michael Bebenita notifications@github.com
|
Agreed this would be a non-normative note at most. In SM, we are also currently intending for instantiation to never recompile so that there is a predictable compilation cost model for devs (in particular, so that devs can use But I do appreciate that separate compilation is fundamentally at odds with a variety of optimizations one might want to do to specialize generated code to ambient parameters. Fusing compilation and instantiation into one async op makes sense and is something we've considered in the past. The downside, of course, is that this inhibits explicit caching (there is no
The latter intrigues me and could provide a much nicer developer experience than using IDB and perhaps a chance to even further optimize caching (since the cache is specialized to purpose), but it's certainly a big feature and something we'd want to take some time to consider. |
@rossberg-chromium I seem to have explained my purpose badly: I don't want to quibble over what the spec says. I'm trying to point out what seems like a serious surprise for developers, hiding under the API. A developer won't expect @lukewagner even with implicit or explicit caching we may have the same issue: how many Maybe I'm not trying to eliminate the possibility of re-compilation, I think we rather want a common idiomatic API usage which has perfect information w.r.t. |
@jfbastien With implicit/explicit caching that was provided the particular instantiation parameters (so |
There may be:
At this point I agree you don't need recompilation, but we're being a bit silly of we do slow bounds checks when we don't need to. As I said: I like this |
Well that's my point about having an enhanced cache that accepts instantiation parameters and bytecode: the cache is free to recompile if what it has cached doesn't match the instantiation parameters. So the steps would just be:
and after step 8, all future page loads will get cached fast or slow code. |
@lukewagner First of all, you're proposing a mitigation that flies in the face of the stated goal of WebAssembly providing deterministic performance. The difference between slow and fast was last quoted as around 20%, so it would really stink if a spec that painstakingly aims aims for deterministic perf drops it on the floor because of an API quirk. I don't buy that the browser having a content-addressed cache is the right solution, because the spec already goes to a lot of trouble elsewhere to obviate the need for profile-recompile-cache optimizations. For example, we promisify compilation precisely so that the app can get reasonable behavior even if the code is not cached. If the way this is spec'd necessitates all of us to implement caches or other mitigations then we will have failed our goal of giving people a reasonably portable cost model. To me the issue is just this: one of the optimizations that we will all effectively have to do for competitive reasons (the 4GB virtual memory bounds checking, which I'll just call the 4GB hack) cannot be done in the current spec without sacrificing one of these things:
I think that this means that the spec will encourage vendors to converge to just allowing 4GB reservations anytime wasm memory is allocated, or to have cache/lazy-compile/profile optimizations to detect this. Finally, I don't understand the point about making any of this non-normative. This can be normative, because we could make the API preclude the possibility of the browser having to compile something without knowing what kind of memory it will have. I imagine that there are many ways to do this. For example, instantiating could return a promise and we could remove the separate compile step. This would make it clear that instantiation is the step that could take a while, which strongly implies to the client that this is the step that does the compilation. In such an API, the compiler always knows if the memory it's compiling for has the 4GB hack or not. It's sad that we're only noticing this now, but I'm surprised that you guys don't see this is a bigger issue. Is there some mitigation other than caching that I'm overlooking? |
@jfbastien in your motivating scenario, you pointed out that the module was authored to prefer fast memory. I'm assuming you're primarily chasing enabling the fast memory optimization when a particular module wants it, and might be OK with not doing it when the module doesn't want it (nothing bad with opportunistically stumbling upon it in that case, too, just trying to tease apart priorities). If so, how would these alternatives to caching or to async Instantiate feel like:
|
For the issue of the "4GB hack" and mismatches between memory using it and code expecting it, would it make sense for compilation to internally emit two versions of the code? (Obviously this would use more memory, which is sad, but hopefully compile time wouldn't be much worse, the writer could generate both at once?) |
@mtrofin I don't think it makes sense to ask for 4GiB if you don't intend to use it. The virtual allocation is separate from the use intent, so I think we'd need to separate both. On 2.: it still isn't super helpful to the developer: if they use that variant and it fails, then what? @kripken I don't think double compilation is a good idea. |
@kripken I think that's what we would do without any other resolution to this issue. I want WebAssembly to be great in the case of casual browsing: you tell me about a cool thingy, send me the URL, I click it, and I amuse myself for a few minutes. That's what makes the web cool. But that means that many compiles will be of code that isn't cached, so compile time will play a big part in a user's battery life. So, double-compile makes me sad. |
That's not really practical, since many devices don't have 4GB of physical memory. Also, that's hard to spec.
I don't think we want double compiles. |
@pizlonator Thus far, we haven't considered designs which required different modes of codegen: we've just always allocated 4gb regions on 64-bit and observed this to succeed for many many thousands of memories on Linux, OSX and Windows. We have a conservative upper bound to prevent trivial total exhaustion of available address space which I expect will be sufficient to support the many-small-libraries use case. So I think the new constraint we're addressing here is that iOS has some virtual address space limitations which could reduce the number of 4gb allocations. So one observation is that a large portion of the bounds-check-elimination allowed by the 4gb hack can be avoided by just having a small guard region at the end of wasm memory. Our initial experiments show that basic analyses (nothing to do with loops, just eliminating checks on loads/stores w/ the same base pointer) can already eliminate roughly half of bounds checks. And probably this could get better. So the 4gb hack would be a more modest, and less necessary, speedup. Another idea I had earlier would be to pessimistically compile code with bounds checks (using elimination based on the guard page) and then nop them out when instantiating with a fast-mode-memory. Combined, the overhead could be pretty small compared to idealized fast-mode code. |
This isn't an iOS-specific problem. The issue is that if you allow a lot of such allocations then it poses a security risk because each such allocation reduces the efficacy of ASLR. So, I think that the VM should have the option of setting a very low limit for the number of 4GB spaces it allocates, but that implies that the fall-back path should not be too expensive (i.e. it shouldn't require recompile). What limit do you have on the number of 4GB memories that you would allocate? What do you do when you hit this limit - give up entirely, or recompile on instantiation?
I agree that analysis allows us to eliminate more checks, but the 4GB hack is the way to go if you want peak perf. Everyone wants peak perf, and I think it would be great to make it possible to get peak perf without also causing security problems, resource problems, and unexpected recompiles.
Code that has bounds checks is best off pinning a register for the memory size and pinning a register for memory base. Code that uses the 4GB hack only needs to pin a register for memory base. So, this isn't a great solution. Besides the annoyance of having to wrangle the spec and implementations, what are the downsides of combining compilation and instantiation into one promisified action? |
I'm not an expert on ASLR but, iiuc, even if we didn't have a conservative bound (that is, if we allowed you to keep allocating until
Well, since it's from asm.js days, only 1000. Then the memory allocation just throws. Maybe we'll need to bump this, but even with many super-modularized apps (with many seprate wasm modules each) sharing the same process, I can't imagine we'd need too much more. I think
As I mentioned above, adding a |
New idea: what if we added an async version of That still leaves the question of what to do when a The IDB part is a bit more work, but that seems pretty close to ideal, performance-wise. WDYT? |
I think adding async I haven't thought about the caching enough to have a fully-formed opinion yet. |
ASLR affects both code and data. The point is to make it more expensive for an attacker to weasel his way into a data structure without chasing a pointer to it. If the attacker can exhaust memory, he definitely has more leverage.
1000 seems like a sensible limit. I'll ask around with security folks.
Right. I can see a few ways that such an API could be made to work. A cheesy but practical API would be to overload eval:
and then Instance has a getter: module = instance.module Where module is structure cloneable. What do you think of this?
Intriguing. Relative to my idea above: Pro: yours is an easy to understand abstraction that is conceptually similar to what we say now. There are three areas where your proposal doesn't give the user as much control as mine:
So, I like mine better. That said, I think your proposal is a progression, so it definitely sounds good to me. |
This issue rests upon how often fragmentation makes allocation of fast In the scenario where the user is browsing around the web and using lots of Another possible strategy is to generate one version of the code with On Thu, Oct 27, 2016 at 9:03 PM, pizlonator notifications@github.com
|
It's not just ASLR: it's also pagetable / allocator / etc pollution. We all need to talk to our security folks as well as out kernel / systems folks. Or we can be up-front to developers about the limits each engine imposes on "fast" There's all these crutches we can use, such as nop or double compilation, but why even have crutches? |
@jfbastien I don't think PROT_NONE memory costs page table entries; I think there is a separate data structure that holds mappings from which the page table is lazily populated. @pizlonator I like that idea, and I can see that being what we encourage everyone to use by default in tutorials, toolchain, etc. It's also more succinct and easier to teach if you can simply ignore However, I think we shouldn't take away So if agreed on that, then what this boils down to is a purely additive proposal of the two One tweak, though: I think we shouldn't have a |
Having an async one step API as the recommended case for the typical monolithic app makes sense as the recommended pattern. Agreed with @lukewagner that both the all sync (inline compile) case covered by new Module + new Instance is useful. Adding the two eval variants proposed seem an ok way to introduce this. However I don't like the name, because it will be conflated in (security) folks mind with js eval (which it resembles in one way, but not in terms of scope capture). |
Hah, good point, eval does have a bit of a rep. +1 to |
What would the guideline to the developer be wrt when to use the async instantiate? |
@mtrofin To use |
@lukewagner This seems reasonable.
Agreed.
That's what it sounds like.
Sure feels like a data property is better. Or having WebAssembly.instantiate always return an instance, module pair. |
Is this correct: Suppose you |
@pizlonator Yeah, I can bikeshed it in my head multiple ways. I think I like returning the pair a little better since it'll probably lead to less people accidentally entraining an unused @mtrofin Recompilation can still be necessary when you pluck the |
This thread is getting long, looks like it's converging but to be 100% sure we need to write up the code that we expect different users to write:
Anything else? It sounds like @lukewagner has ideas around imports which I don't fully grok. |
Oh, I didn't understand that having the imports available before the wasm XHR even begins is key here. That's a lot trickier then. So most of what I said before is wrong. For us to have the imports, we need to have downloaded, parsed and run all the JS glue. If we do all that before the wasm XHR even begins then it's a very different loading scheme and set of tradeoffs than we have now. In particular, for small to medium projects, maybe this wouldn't even be a speedup, I guess we'd have to measure this - if we haven't already? This would not be something simple for Unity to do. It would require significant changes to the code emcc emits, so that the JS glue can be run before the compiled code is ready. Perhaps we'd want to consider a new model of emitted JS, one JS file for the imports, one JS file for the rest? And it would be opt-in, so we wouldn't break anyone. Anyhow, there's a lot of consider, and without measurements it's hard to guess what is optimal. |
@kripken Not before the XHR but after it completes, and some time before we start running the script that wants to synchronous access to the exports object. I'd expect it could be as simple as putting that exports-using code in some callback called when the instantiation resolves. |
Hmm, I guess I still don't fully understand this then, sorry (in particular, I don't understand why the asyncness interacts with getting the imports - could the benefit of having the imports not work synchronously?). But it seems like the issues I mentioned above are still relevant even if this is after the XHR completes - that is, we have a single script now that generates the imports and also receives the exports. Splitting that up may lead to tradeoffs - we'll just need to measure when we have time. About putting the exports-using code in a callback, sadly it won't just work for the reasons mentioned earlier, but we could investigate as part of those measurements adding some new compilation mode. |
The signature of |
Talking offline, @lukewagner impressed upon me that the main issue here is having the Memory while compiling the module. Overall then, it seems that there are three issues that relate to how easy it is to use this new API in practice:
Given how the toolchain currently works, doing 2 + 3 is hard, as described above, because it will break existing users. We can do it, but possibly not want it on by default, at least not initially - we'd need to consult community members etc. And to do it really well - no new overhead - will require time and work (doing it quickly can be done by adding a layer of indirection on either the imports or exports). But some other options are trivially easy to use:
I think the last option makes sense. It basically means adding a sync version of the new API, making things more symmetrical with our existing synchronous compilation option. And I don't see a reason to tie the new optimization of knowing the Memory at compile time to asynchronous compilation? (async is great, but a separate issue?) |
I suggest meta objects that describe the imports (the memory) to break the constraint that the imports need to be allocated before compiling. Then the It would also be very useful to be able to compile before allocating the linear memory on a memory constrained device, also to be able to optimize compilation for memory allocated at zero in the linear address space, and perhaps for memory with guard zones, and to optimize for memory with a fixed maximum size that can be a power of two plus a spill zone. This meta object could also be used by a wasm user decoder/re-writer to emit code optimized for the negotiated memory characteristics. This is something being suggested as necessary around the start of this project many years ago now! |
@kripken While the sync APIs are, I believe, technically necessary, they definitely should be the not-recommended path or else we'll be introducing tons of unnecessary main-thread jank, a regression compared to asm.js + <script async>. Having a compilation API that takes the I think it totally makes sense to start with a new clean slate that is oriented toward the async APIs + caching. This could offer opportunities to fix some of the other inefficiencies we've noticed where load-time steps are executed sequentially when they could be done in parallel. It seems fine to have a compiler flag that emits output equivalent to the old sync asm.js style for the handful of users who depended on this precise output format. |
I've been meaning to provide a better example for why we need to explicitly discourage (or, as I've argued in the past, even disallow) synchronous compilation/instantiation. Hopefully this provides additional fodder for discussion. Let's zoom out and talk about user experience for a second. Here's a comparison between loading the AngryBots demo asynchronously (via asm.js), left, and synchronously (via WebAssembly in V8), right. Synchronous compilation provides a terrible user experience and breaks progress bars or any other visual indication of application loading. Whatever the effort, I think we need to work together to make sure that Emscripten, Unity, and other tools export asynchronously loading WebAssembly by default. cc @jechter @juj Not only that, I think we need to make it prohibitively difficult for developers to fall into the trap of writing synchronous loading code. We need to explicitly encourage Let's dive a little deeper and see how bad that progress bar on the right is, for WebAssembly. Here's a trace captured with the DevTools performance panel: It takes 30 seconds on my MacBook Air to show any progress bar at all. What takes that time? Let's zoom into the ~20s after the wasm module has downloaded when the main thread is totally locked up: Compilation is taking ~20s and instantiation is taking ~2s. Note that in this case, the Unity loader is already calling the asynchronous The 2s instantiation jank is due to the Unity loader code calling the synchronous I think it's also worth mentioning that this is not merely the risk of a developer shooting themselves in the foot or not. The average web page includes dozens of third party scripts: those scripts all have the potential to include top-document-janking WebAssembly. (If you'd like to explore the trace in more detail you can view the full timeline here: https://chromedevtools.github.io/timeline-viewer/?loadTimelineFromURL=https://www.dropbox.com/s/noqjwql0pq6c1wy/wasm?dl=0) |
I agree 100% that async is great :) But I'm saying it's orthogonal to the optimization of getting the Memory at the same time as compiling the Module. And that we can't trivially get async+that optimization - we can get that combo, but either at the cost of some overhead, or of time if we introduce a new flag for it and find a way to phase it in to be default. So if we want that Memory/compilation optimization now, on by default, and at full efficiency, then tying it to being also async is a problem. But if it either isn't urgent, or we are ok with it not being on by default, or we are ok with some new overhead, then no problem. |
Since developers are opting into wasm, I think it makes sense to take the opportunity to change the top-level structure a bit with a new default (with a way to opt-in to the old behavior if necessary). People using sync <script> tags after a sync instantiation script seems like an antipattern that we should migrate to something async anyway. |
@kripken I think you come to different conclusions than I, @lukewagner and @s3ththompson have because to you the most important part is to offer a seamless transition from asm.js for existing developers. Is this correct? I agree this is a valid concern. I weigh it lower because IMO with WebAssembly we're bringing way more developers in than asm.js had. I'd like to avoid burning early adopters, but IIUC existing developers are pretty flexible and want asynchronous compilation. If we assume that they want asynchronous compilation, and are willing to refactor some to get it, then all that's left is having memory at compilation time, which is what this API provides. It's highly desirable because it avoid a pitfall. Earlier you expressed a concern at the amount of work involved on the Emscripten side to support this. Do you still think this is an issue? |
FWIW, Unity 5.6 will use WebAssembly.instantiate.
jonas
… On Dec 28, 2016, at 11:42 PM, Seth Thompson ***@***.***> wrote:
I've been meaning to provide a better example for why we need to explicitly discourage (or, as I've argued in the past, even disallow) synchronous compilation/instantiation. Hopefully this provides additional fodder for discussion.
Let's zoom out and talk about user experience for a second.
Here's a comparison between loading the AngryBots demo with asynchronous calls (via asm.js), left, and synchronous calls (current WebAssembly implementation), right.
Synchronous compilation provides a terrible user experience and breaks progress bars or any other visual indication of application loading.
Whatever the effort, I think we need to work together to make sure that Emscripten, Unity, and other tools export asynchronously loading WebAssembly by default. cc @jechter @juj Not only that, I think we need to make it prohibitively difficult for developers to fall into the trap of writing synchronous loading code.
We need to explicitly encourage WebAssembly.compile & WebAssembly.instantiate and discourage new WebAssembly.Module & new WebAssembly.Instance.
Let's dive a little deeper and see how bad that progress bar on the right is, for WebAssembly.
Here's a trace captured with the DevTools performance panel:
It takes 30 seconds on my MacBook Air to show any progress bar at all.
What takes that time? Let's zoom into the ~20s after the wasm module has downloaded when the main thread is totally locked up:
Compilation is taking ~20s and instantiation is taking ~2s.
Note that in this case, the Unity loader is already calling the asynchronous WebAssembly.compile if it's supported, so the 20s is because V8 still does a synchronous compile under the hood. cc @titzer @flagxor this is something V8 really needs to fix.
The 2s instantiation jank is due to the Unity loader code calling the synchronous new WebAssembly.Instance. This is what we really need to fix in both Unity code and Emscripten.
I think it's also worth mentioning that this is not merely the risk of a developer shooting themselves in the foot or not. The average web page includes dozens of third party scripts: those scripts all have the potential to include top-document-janking WebAssembly.
(If you'd like to explore the trace in more detail you can view the full timeline here: https://chromedevtools.github.io/timeline-viewer/?loadTimelineFromURL=https://www.dropbox.com/s/noqjwql0pq6c1wy/wasm?dl=0)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
@jfbastien Maybe I don't understand what you mean by "conclusions". I am mostly just presenting options here. I don't have a conclusion myself. If we want the Memory-at-Compile-time opt right now, I proposed a few options that can let us have it ASAP, trivially modifying our current sync compilation. Or if we want that opt now in an async way, I proposed an option ("1 + 3", i.e., not receiving all the imports at compile time, just the Memory) that can let us have it ASAP as well. Or if we want to focus now on an async version of that opt with the current API (not "1 + 3") then we can get that too, it will just take some careful planning because it would be a breaking change. I don't think it's an option for us here to decide to break existing users, so we would need to consult the community. Possibly there will be few concerns and we can just do it, in which case, how much effort depends on how much overhead we are willing to tolerate - if we cannot accept any overhead at all, then it might be a lot. Or possibly there will be concerns, which is my personal intuition - any breaking change must be done gradually - in that case it will take longer, we'd probably start with a new option and eventually plan to make it the default. Again, no conclusions from me. All the above are options. It really depends what you care about more: Memory-at-Compile-time, or async, or both; and the issue of tolerating new overhead vs not; and whether you want this urgently or can wait, etc. etc. I'm happy to help on the emscripten side with any of the above, however people decide. |
To be sure I understand:
Put another way: what's the ideal end-state for WebAssembly if we had infinite time and no "legacy"? I think we've designed for that ideal end-state, and you're pointing out hurdles on the way, but I'm still not sure that's the case. If so, then I'm not sure I have enough information to figure out if WebAssembly needs a stopgap to ease transition, or if temporary burden is acceptable for a subset of Emscripten users (and which users). |
About the overheads in some of the options: As described above the tricky thing is we must send the imports and receive the exports synchronously currently, and that we don't have the JS imports until the JS is ready (but we do have the Memory!). One way to get around that is to add a layer of indirection on either the imports or exports. For example, we could provide thunks instead of the true imports very early (in the HTML, before the JS), so we appear to provide them synchronously (but it's just the thunks, which are trivial to create even before we have the JS). Then later when we have the JS, we can update the thunks to point to the right place. This would add overhead of another JS call and a JS object lookup for each import call. Also some overhead in code size and startup. About breaking users: We want existing emscripten users to be able to flip a flag and get working wasm. Many users and partners are happy about that. It lets them compare asm.js and wasm and it lets projects they spent effort on porting benefit from wasm with no new porting effort. And it avoids the risk that if they also need to asyncify their startup code while porting to wasm, and something breaks, they might blame wasm unnecessarily. |
That seems incidental and not something we should bake into time forever with an API. Furthermore, for the reasons @s3ththompson explained, even ignoring the memory-at-compile-time issue, we need instantiation to be async anyway. So I agree with @jfbastien that we have our ideal end-state API here and that should not be what we compromise on here. About breaking users: if we offer a simple migration path (like an "onload" event/promise), it should be simple for users to migrate, with the carrot being all the major perf wins. I don't think we have any evidence that exact asm.js source backcompat is a hard requirement for the default configuration; we have many examples of quite the opposite: users willing to do whatever to achieve the optimal perf since that is, of course, the whole point. |
Talking to @lukewagner offline, we think the best thing is to start with enabling instantiate for wasm in the default recommended setting in emscripten, by adding an indirection layer (as described above, but on the exports). It will probably have negligible overhead for most use cases. So this gets us the benefits we all want here. Eventually, we can get rid of that small overhead, but it will be a breaking change that will require a bunch more planning and mailing list discussion etc., as there isn't an obviously good/right way to do it, we realized. But since we think the overhead of the indirection layer is very low, this part isn't urgent. |
@kripken great! It would be helpful to have a diagram showing different instances / JS / imports / exports / memory / tables. Do you want to move this discussion elsewhere (Emscripten / binaryen repo)? I have a mental picture of what I think C++ code should be organized at, but it's pretty obvious by now that you don't have the same picture! You've got more experience there, so I'd love to learn from it and help however I can. |
@jfbastien: sure. I'm not clear yet on what you're looking for in a diagram, but yeah, maybe another place is better. For the implementation of this in emscripten there is the issue mentioned before, emscripten-core/emscripten#4711, also feel free to open another if that doesn't cover what you want. |
IIUC Emscripten uses this by default now. Closing. |
To follow-up on @s3ththompson comment, note that synchronous compilation and instantiation is useful. Notably, I recently ran into a situation where I wanted to synchronously load and compile a WebAssembly module in Node.js ( When deciding whether to provide either async, sync, or both APIs, remember that a browser context is not the only environment in which people want to use WebAssembly. A number of people, myself included, are interested in WebAssembly's potential as an efficient compile target combined with the Node.js runtime, akin to the JVM. While async import/exports may land in Node.js, synchronous import and export will remain the dominant pattern for the foreseeable future. In which case, the ability to load a WebAssembly module and synchronously compile and instantiate that module is highly useful. |
@kgryte I should have clarified that my comment pertained primarily to the browser as an execution context. We've landed on an API surface that still exposes the synchronous APIs. Browsers may impose a size limit on modules passed to the synchronous APIs (Chrome already does, for example), but that limit is configurable by the embedder and should not need to apply to Node. |
@s3ththompson Thanks for the clarification. |
Say I have this code:
Is
WebAssembly.Instance
allowed to block for a substantial amount of time. Could it for example recompile theWebAssembly.Module
?In most cases I'd say no, but what if the already-compiled code doesn't particularly like the memory it receives? Say, because that memory is a slow-mode memory and the code was compiled assuming fast-mode? maybe
memories[0]
was a fast-mode memory, butmemories[1023]
sure won't be.What about this code instead:
Are those calls to
WebAssembly.Instance
allowed to cause recompilation?Assuming the above makes sense, here are a few related questions:
WebAssembly.Module
is fast, and that aWebAssembly.Memory
instance is suitable for such fast code? Right now the answer seems to be "try it and see if you can notice".WebAssembly.Memory
instances they're allowed before they get slow code (counting the implicit ones, e.g. as created by the second example)?The text was updated successfully, but these errors were encountered: