-
Notifications
You must be signed in to change notification settings - Fork 482
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
[AST] Add 'Fix' #6793
base: master
Are you sure you want to change the base?
[AST] Add 'Fix' #6793
Conversation
({cpu: 410655782016 | ||
| mem: 1584672947}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol. Must be Fix
impeding optimization, I can't think of anything else that could cause code to become slower.
|
||
Aggregate Multi Key | ||
|
||
n Script size CPU usage Memory usage | ||
---------------------------------------------------------------------- | ||
- 1705 (10.4%) 3446371236 (34.5%) 422386 (3.0%) | ||
- 1698 (10.4%) 3439411236 (34.4%) 378886 (2.7%) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About 10% improvement for things in this file. Not a lot, but not too little either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very insignificant changes.
({cpu: 193142904 | ||
| mem: 819552}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2% for CPU and 3% for MEM is nothing.
({cpu: 77960900 | ||
| mem: 424300}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5.9% for CPU and 6.7% for MEM is better, but still not a lot.
| budget: ({cpu: 2999382 | ||
| mem: 14712}) | ||
| budget: ({cpu: 2215382 | ||
| mem: 9812}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
26% CPU and 33.3% MEM.
| budget: ({cpu: 6960100 | ||
| mem: 43600}) | ||
| budget: ({cpu: 5600100 | ||
| mem: 35100}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
19.5% CPU and 19.5% MEM.
| budget: ({cpu: 8124345 | ||
| mem: 48505}) | ||
| budget: ({cpu: 6684345 | ||
| mem: 39505}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
17.7% CPU and 18.5% MEM.
({cpu: 128100 | ||
| mem: 900}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warmup cost for the Z combinator is really high somehow, compared to the native fixpoint combinator.
({cpu: 5486960 | ||
| mem: 26920}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
9.5% for CPU and 11.8% for MEM.
3410 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Size differences are all insignificant, unsurprisingly.
5f99156
to
2da59cb
Compare
/benchmark nofib |
Click here to check the status of your benchmark. |
Comparing benchmark results of 'nofib' on '52664194d' (base) and '2da59cbe1' (PR) Results table
|
looooooooooooooool |
/benchmark nofib |
Click here to check the status of your benchmark. |
/benchmark lists |
1 similar comment
/benchmark lists |
Comparing benchmark results of 'nofib' on '52664194d' (base) and '2da59cbe1' (PR) Results table
|
Click here to check the status of your benchmark. |
OK, so that was noise. It's kinda hard to believe that the new way of doing recursion is as efficient as the old one... But at the same time I don't think that the benchmarks don't use the old Z combinator, because the golden budget tests are updated. This is really funny, is the cost of doing recursion this insignificant for complex programs like |
Comparing benchmark results of 'lists' on '52664194d' (base) and '2da59cbe1' (PR) Results table
|
Click here to check the status of your benchmark. |
Comparing benchmark results of 'lists' on '52664194d' (base) and '2da59cbe1' (PR) Results table
|
/benchmark marlowe |
1 similar comment
/benchmark marlowe |
/benchmark validation |
1 similar comment
/benchmark validation |
/benchmark bls12-381-benchmarks |
1 similar comment
/benchmark bls12-381-benchmarks |
Comparing benchmark results of 'nofib' on '52664194d' (base) and '2da59cbe1' (PR) Results table
|
b16c9a6
to
ab2fd81
Compare
What happens if we evaluate a reference to
What if we changed the definition of
I agree, this is pretty tricky. I'd be tempted to at least try a version that doesn't rely on the in-memory loop. I'd be interested to see if there is much difference in performance. Overall the benchmarks are a bit surprising. I was thinking there might be size benefits, but also no. Unless we can squeeze more performance out of it, I'm unsure if it's enough of a win. I guess we just don't do enough recursion for it to really bite 🤔 |
Very good point, thank you. It is indeed odd. But a bottom is a bottom, I guess, we can always says that we don't distinguish one bottom from another like GHC does. Operationally it's kind of a mess though, I agree.
It'd be a bunch of pointless overhead, as long as this specific PR is concerned. I think looking inside of the body of a
Me too, but I couldn't come up with a solution quickly and ended up implementing a circular program instead 🙈
I'd also be curious, but if we agree that what we see doesn't warrant an extra AST node, then neither will another solution and we shouldn't waste time figuring it out.
Yeah, I feel the same way and @colll78 said something along these lines too. |
Actually, I think it's a bug, so thanks for catching! We don't have names and even if we had them, that would be variable capture. But currently referencing I think what we need to do in I'll try it out. Thanks! |
I guess overall I would expect this to have basically identical semantics to our existing recursion combinators, and if it doesn't then that's a bit suspicious. Being strict does make the naive approach tricky. We can't just stick the recursively defined value into an environment, since it's not a value until we evaluate it, which we can't do until we've put it in the environment. Did we consider a version that is instead an encoding of Z? So we know that the body can be eta-expanded? |
834aa00
to
b2b4864
Compare
412a66e
to
97c1471
Compare
/benchmark nofib |
1 similar comment
/benchmark nofib |
/benchmark lists |
1 similar comment
/benchmark lists |
Click here to check the status of your benchmark. |
Comparing benchmark results of 'nofib' on '30c3db402' (base) and '97c1471b6' (PR) Results table
|
Click here to check the status of your benchmark. |
Comparing benchmark results of 'nofib' on '30c3db402' (base) and '97c1471b6' (PR) Results table
|
Click here to check the status of your benchmark. |
Comparing benchmark results of 'lists' on '30c3db402' (base) and '97c1471b6' (PR) Results table
|
Click here to check the status of your benchmark. |
Comparing benchmark results of 'lists' on '30c3db402' (base) and '97c1471b6' (PR) Results table
|
Re: @michaelpj comment above. Yes, it is common to require the body of the fix to be a value. Here are the relevant lines from a denotational semantics I wrote recently. Might be worth checking whether requiring the fixpoint body to be a lambda abstraction makes a difference in the benchmarks. Intrinsically-typed term constructor (where
Fixpoint operator.
Denotational semantics
|
This is an experiment adding
Fix
to the AST to see how much faster we could recurse.The main change is adding this constructor to the untyped AST:
which in the typed ASTs becomes
with the following typing:
and the following operational semantics:
I.e. we make
fix
a binder so that we can evaluate the body of afix
once (the (1) rule) and then substitute it for the name of the recursive call without recomputing it each time (the (2) rule).In the CK machine this looks as
The reason why we match on
VLamAbs
is to make the semantics of the CK machine match the ones of the CEK machine where we have to match onVLamAbs
in order to update its environment, this is how it looks:The tricky parts:
computeCek
we save the current size of the environment inFrameFix
and proceed to evaluate the body in the current environment extended with aVBlackHole
.VBlackHole
is a new kind ofCekValue
that we add to provide a value for the recursive call while we're evaluating the body of a fixpoint to a value. If while evaluating the body we reachVBlackHole
, we know that this is infinite recursion and terminate the program earlier. The name is inspired by black holes in GHC, which are thunks being evaluated (and evaluation of a thunk runs into a blackhole, that is also a loop and the Haskell program terminates).rec :=> VBlackHole
environment entry by replacingVBlackHole
with an actual value forrec
, for that I addedcontUpdateZero
(similar tocontIndexZero
). At that point we might have added new entries to the environment and so we need to calculate the De Bruijn index ofrec
in the environment by subtracting the old size of the environment from the new one.FrameFix
inreturnCek
without creating any new frames. We're able to achieve that because we encode recursion through a circular object:env'
refers tobodyV'
and vice versa -- this is an actual loop in memory. The idea here is to make the environment ofbodyV'
containbodyV'
itself, including its environment containingbodyV'
, including its environment...This should be the fastest way of evaluating
Fix
, since it doesn't create any additional frames, we just literally evaluate the body of a fixpoint multiple times. But it comes with a big disadvantage of creating an infinite structure requiring very careful handling if we want to support it indischargeCekValue
or theShow
instance (we'll need to either useunsafePtrEquality
to check for pointer loops or makeVLamAbs
carry a different sort of name indicating that it's an infinitely unrolled fixpoint).Yet, even with such a performance-focused design the results are very underwhelming:
nofib
isn't detectably faster (do we have other benchmarks that get recompiled every time?). Thelists
benchmarks are faster by up to 14.5%, which is normally a very welcome improvement, but given that we're adding an entire new AST node (with very non-trivial evaluation rules), is it worth it?Note that we don't save much size either, this is what replacing the Z combinator with native recursion in a script amounts to:
![image](https://private-user-images.githubusercontent.com/10480926/405251813-da405357-8093-42e8-be64-4be8725ad661.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzk1NzUwNDUsIm5iZiI6MTczOTU3NDc0NSwicGF0aCI6Ii8xMDQ4MDkyNi80MDUyNTE4MTMtZGE0MDUzNTctODA5My00MmU4LWJlNjQtNGJlODcyNWFkNjYxLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMTQlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjE0VDIzMTIyNVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTgyMmU0M2ViMGNjOWE5MTFkZWRhMmNkNTk5NmM4OWQ5NWJhNTlmN2EwNjQ0ZTcwZmQzMWRjZDFkMmUxMGIxZjUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.o4gMqyuPSw1yLMLWf5O6pqRsgH0xvtXRldI4wtUWH28)