-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
TRANSCODE function API model needs revamp #1916
Comments
Submitted by: BrianH First, some basics about the common uses of TRANSCODE. The most common current use is with the /next or /only options, with the results most often going to SET/any. Not SET, because TRANSCODE with those options tends to be in more low-level code where people are more careful with error triggering. With the /next or /only options, you almost always need the continuation too; ignoring it is rare since incremental translation is the main use of those options. Using TRANSCODE to do a translation of more than just the first value is rare. A full source translation can be currently done easier with TO BLOCK!. TRANSCODE/error allows incremental translation with possible recovery from errors, but that is a really difficult task that noone has taken on yet. Nonetheless, TRANSCODE/error really needs that continuation if you want to have a hope of recovering. The big win will come with TRANSCODE/part (#1915), because that solves a real problem that TO BLOCK! can't without full source copy overhead. That probably doesn't need the continuation set since you know the offset ahead of time, and can get a reference to the source at that offset whenever you want. However, this is a case where the relative overhead of an intermediate wrapper-block is trivial. Now, for the proposals, by the numbers. Proposal 3 is likely to have the least overhead in the function itself, closely followed by the rest. Proposal 2 optimizes for the most common usage patterns, but you would have to have /next, /only or /error all imply /cont, or else you'll have to specify /cont on most of the common cases, like you have to with proposal 4's /then option. Plus, the option processing overhead. And there's no decent name for the /cont option, afaict. You're better off without the options, sticking with fixed behavior. That leaves proposals 1 and 3, the intermediate-block tweak and the continuation-word argument, no options. TRANSCODE is low-level enough that we can get away with a weird proposal like 3 if the overhead is lowered enough. Proposal 1 is the closest to being Rebol-like, leading to a greater likelihood that people can understand the code you would write to that option; it would require the fewest changes to the TRANSCODE function and none at all to existing code that runs on it, but bring us huge benefits when we implement the /part option. I'll say that I prefer proposal 1, with the willingness to switch to 3 if it brings us overhead reductions that are big enough to outweigh the feeling that you're programming in Pascal. I would be more than happy to find out that proposal 1 was more efficient that 3 though. |
Submitted by: BrianH Severity of major because it's a behavioral change that may require code changes, how much code depending on which proposal we do. For proposal 1 or 2 no known existing code will need changing. |
Submitted by: BrianH Upon reviewing the native code, it looks like the intermediate block options are looking worse. The values end up having to be passed to SET or SET/any block!, which has more overhead than SET or SET/any word! or set-word assignment. This means that the overhead of setting the word has been moved to somewhere less efficient. As for the word-passing proposals, it looks like the cost of processing the arguments of proposals 3 and 4 would be the same. Given how the code looks between 3 and 4, 4 looks a bit better, at the cost of one more slot in the stack frame. |
Submitted by: fork I spent a fair bit of quality time with Red's lexer, which takes binary UTF-8 input and processes it in the PARSE dialect. So once I understood what TRANSCODE did, it jumped out to me that this was a generally useful thing for a PARSE of a binary! to be able to do. e.g. Just as it is useful to write: parse ["Hello" 10 20] [string! copy value 2 integer!] It could be useful to write: value: []
parse rejoin [#{FFFF} to-binary "{Hello} 12-Dec-2012" #{0000}] [
2 #{FF} transcode value 2 #{0000}
] I can imagine scenarios where binary wire formats might have a bunch of stuff surrounding a little pocket of UTF-8 encoded Rebol that was slipped in with What the dialect format should be in lieu of the /part or /only I'm not sure. (e.g. how to specify the refinements; one could use refinement syntax without actually doing refinement lookup, but I don't know the impact). I gave an example of value: []
parse rejoin [#{FFFF} to-binary "{Hello} 12-Dec-2012" #{0000}] [
2 #{FF}
pos:
(
value: transcode/part pos
newpos: last value
take back tail value
)
:newpos
#{0000}
] Then This looks to be the shape of what's going on, and it could be a nice general parse feature. I proposed that if there is some really common case of the shape of a block for the most common invocation of TRANSCODE, then PARSE could be finessed so it recognized that particular construction and short circuited the parse engine to optimize for it. e.g. if the loader/etc. had a very specific form of call like: value: []
parse bin-input [transcode value 1 nextpos:] The parse native could simply go "Hey, is the length of the rule block 4? Is the first symbol 'transcode? Is the second a word!? Is the third the integer 1? Is the fourth symbol a set-word?" If those things match it goes straight to work. You might even find it turns out faster doing that than running through the refinement lookup for the original function. So before rejecting this idea out of hand because it "won't perform" let's consider if the design makes sense... it looks a lot cleaner to me. |
Submitted by: BrianH Sounds like a good idea in principle, Fork. The API model of the PARSE operation could use some work though, because TRANSCODE can do many things depending on its options, but each PARSE operation can only do one thing with no options because we can't use path expressions (for various reasons we don't need to go into here). So in order to get the full benefit of TRANSCODE, we'd need to add or extend multiple operations. And we need to consider that TRANSCODE currently is optimized to work on binaries and doesn't include a string parser at all. For TRANSCODE/next, we could just extend the datatype/typeset operations from block parsing to binary parsing, where the binary source would be interpreted as UTF-8 Rebol syntax. One whole value would be grabbed then type-checked against the datatype or typeset specified. R2 had something similar for matching some datatypes, but we could go all the way and make the SET, COPY, RETURN and QUOTE operations work here as well. SET would set the word to the constructed value, RETURN would return the constructed value, COPY would set the word to a copy of the matched portion of the source. QUOTE would transcode one value then compare it to the literal value provided using the block parse QUOTE rules. We wouldn't need TRANSCODE without options because SOME and ANY of our incremental transcode operations would integrate better. We wouldn't need TRANSCODE/error because we're already doing incremental parsing, so simply failing at the point where the match fails would be enough for us to backtrack and try an alternate parse rule. We could use Carl's proposed LIMIT operation to implement /part. The only tricky one would be TRANSCODE/only, but I think that we might be able to have INTO do this one. And to extend this to string parsing, we'd have to write a version of TRANSCODE that works on strings (which could have other benefits). Does this make sense? It would be a lot of work, but maybe some in the community would be up for it. We'd still need the TRANSCODE function itself, but the actual parsing code could be shared with PARSE. |
Submitted by: BrianH Proposal to add the functionality of TRANSCODE to PARSE in #2035. |
Ren-C's TRANSCODE returns the entire transcoded block by default. The /NEXT refinement shifts it to where it gives back a "multi-return pack" with a primary result of the remainder, and a secondary result of the transcoded value. When the call to TRANSCODE/NEXT returns null, then there is no more to transcode: https://forum.rebol.info/t/incomplete-transcodes-actually-an-optimization-problem/1940
I'll note that I think having TRANSCODE with no refinements return the entire block of transcoded data, and having /NEXT shift it to where it returns the position as primary and the transcode result as secondary, is legit. (e.g. TRANSCODE/NEXT may as well be a different function called TRANSCODE-NEXT) The same approach is used now for evaluation with EVAL/NEXT: https://forum.rebol.info/t/re-imagining-eval-next/767
Ren-C does not have TRANSCODE/PART at this time, and the above post explains how errors are handled for the moment. |
Submitted by: BrianH
I've examined all of the code that currently uses TRANSCODE to determine behavioral patterns; it's a low-level function and we said that we'd reevaluate its model once we had more data. Based on this, it was notable that TRANSCODE was never used without the /next or /only options. On occasions where you could use TRANSCODE without those options, TO BLOCK! was used instead. I think I've figured out why.
Right now, TRANSCODE returns a block value, with the continuation (the source binary at the position after the decoded portion of the code) appended to the end of the block. If you are using TRANSCODE /only or /next, that returned block only has one value in it, plus the continuation, always a two-value return block. And almost always that two-value block is passed to SET/ANY or SET [var1: var2:](using set-words for FUNCT). In all cases, the return block is discarded.
Having the return value TRANSCODE with those options be passed to SET block makes sense: fixed-length return blocks with particular values in predictable positions is what SET block was made for, and this is Rebol's most common high-level code multi-value return method. For high-level code, making an extra intermediate block is worth the convenience, even if its immediately thrown away.
On the other hand, if you use TRANSCODE without the options you end up with a result that is not only not usable by SET, because of its unpredictable length, it's not usable at all with any convenience because there's an extra value tacked on the end of the block. To use the block, you have to save the last value in some variable (if you need it at all), and then do a CLEAR BACK TAIL on the rest of the block to make it useful. There's nothing convenient about that.
So, TRANSCODE returns a block which contains useful values, but is not itself useful when you use the /next or /only options, or is so unnecessarily awkward to use when you don't use the options that the function is never used without them. Not very Rebol-like.
TRANSCODE needs to return its value and continuation in a usable form and predictable location every time, and it needs to be as efficient about it as possible, both in development of code using TRANSCODE and in execution. This means not appending a continuation to the block of values when you don't use /only or /next, because that has to be undone before the value block is usable. It means thinking of the /next or /only returns as two values, rather than as a block with one value in it then another unrelated one added. Rebol-style multi-value return.
There are four proposed models below. Two with intermediate wrapper-blocks, suitable for use with SET; one of those with the wrapper-block being optional, only returning the values block when you don't specify the option. Two that are passed a word that is set to the continuation, or none if you want to ignore it, with no intermediate block needed; one of those with that word argument being an option. I'll reserve my opinions about which is better for the comments.
This will also make TRANSCODE/part practical (see #1915), which lowers the overhead of embedded scripts and Rebol template languages like RSP. I expect that will be the most common use of TRANSCODE outside of the LOAD infrastructure.
CC - Data [ Version: 2.101.0 Type: Bug Platform: All Category: Native Reproduce: Always Fixed-in:none ]
The text was updated successfully, but these errors were encountered: