-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Vector cmp/mix intrinsics inhibit optimizations #44
Comments
Excellent question. I would love to eliminate these intrinsics, because, as you pointed out, they preclude optimizations and duplicate functionality of existing instructions. I made a few attempts to do this in the past, but ran into issues with the type legalizer. It's highly possible I've just missed something obvious, so it would be great to have someone else take a look. I can spend some time this evening describing the approaches I've taken and problems I've run into if that would be helpful context. |
A summary of what didn't work before would be very useful. |
This architecture uses scalar registers as masks. This differs from AVX, which has a separate set of mask registers. My first attempt was to make the scalar register class able to store the mask type (v16i1) as well as the integer type:
I made my compiler builtins for mix take int32s as parameters and comparisons return int32. Internally, I would bitcast between v16i1 and i32 types to match instruction patterns that took those types. My hope was that all arithmetic, loads, and stores would be done using i32s. Unfortunately, the optimizer would reorganize code so I would end up with arithmetic on v16i1. I was faced with the prospect of having to make all integer instruction patterns accept all combinations of v16i1 and i32. I couldn't find a way to make TableGen expand these cleanly without manually creating a ton of patterns (I already have patterns to match various combinations of vector/scalar operands). I also had to do a bunch of hacks to make pattern matching work properly, but I don't remember the details. I eventually gave up and backed that out to make everything use i32, which is how it works now. The change to back it out is here: ae7a490, which illustrates how I tried to implement it. There's also a semi-coherent thread here where I'm talking to myself about this on the LLVM mailing list :) http://lists.llvm.org/pipermail/llvm-dev/2013-July/063448.html |
On further reflection, I believe supporting v16i1 as a native type would only require one extra arithmetic instruction pattern: (set v16i1:$dest, (OpNode v16i1:$src1, v16i1:$src2)). I think I may have overestimated the complexity of that solution. |
Looking through my previous attempt, there were a lot of places where I was fighting the type system unnecessarily. For example, I defined vector comparison intrinsics that returned an i32 instead of v16i1:
I was fairly new to LLVM when I attempted to implement this the first time. With a few years of experience digging around in LLVM, this seems more straightforward. Everything in the IR code should use v16i1 exclusively for masks rather than i32. The C compiler should expose masks as plain old integers with truncate/bitcast - bitcast/zext combinations to convert back and forth. This means it doesn't need LLVM intrinsics for mix and compare functions. So, to implement the clang builtin to compare to vectors and return an integer mask--int __builtin_nyuzi_mask_cmpi_ugt(veci16_t a, veci16_t b); Value *MaskResult = Builder.CreateICmpULT(Ops[0], Ops[1])
Value *Scalarized = Builder.CreateBitCast(MaskResult, Builder.getInt16Ty());
return Builder.CreateZExt(Scalarized, Builder.getInt32Ty()); Likewise, the clang predication builtin (__builtin_nyuzi_vector_mixi(vec16i_t v1, vec16i_t v2, int mask)) converts the opposite direction: Value *Truncated = Builder.CreateTrunc(Ops[2], Builder.getInt16Ty());
Value *Mask = Builder.CreateBitCast(Truncated, llvm::VectorType::get(Builder.getInt1Ty(), 16));
return Builder.CreateSelect(Mask, Ops[0], Ops[1]); |
As I play with this more, remembering why this was troublesome. :) The problem starts when lowering the INSERT_VECTOR_ELT (insert vector element) SDNode in NyuziTargetLowering. There is no native instruction to do this, so I want to lower it to a sequence of shifting a value by the lane index and doing a masked transfer of the scalar value into the register. First, I compute the mask with a shift of a constant: SDValue Mask = DAG.getNode(ISD::SRL, DL, MVT::i32, DAG.getConstant(0x8000, DL, MVT::i32),
Op.getOperand(2)); This is currently of type i32. I need to convert it into a v16i1 in order to use it with a vselect. Normally, I would use the same trick I used in the C++ compiler: first TRUNCATE it to a i16, then BITCAST it to a v16i1. However, this wont work because this is running in the legalize ops pass, which is after the legalize types pass (http://llvm.org/docs/CodeGenerator.html#selectiondag-instruction-selection-process). i16 is not a valid type, so it asserts here. I can't do a shift directly on the v16i1 type, since that would assume it is shifting individual elements independently. Making the constant or shift be i16 won't work either (they are also illegal types). My next thought is to lower that into a new Nyuzi specific pseudo instruction custom instruction inserter to create the machine instructions directly. In NyuziISelLowering.cpp: SDValue NyuziTargetLowering::LowerINSERT_VECTOR_ELT(SDValue Op,
SelectionDAG &DAG) const {
SDLoc DL(Op);
return DAG.getNode(NyuziISD::INSERT_VECTOR, DL, Op.getValueType(),
Op.getOperand(0), Op.getOperand(1), Op.getOperand(2));
} In NyuziInstrFormats.td:
In NyuziInstrInfo.td:
|
Here is an experimental patch that partially implements this. It's not the right fix, but demonstrates some of the considerations for this change. It passes most of the llvm-lit tests, but runs into problems generating code for programs in the NyuziProcessor tree. I've created a custom SDNode type A few observations:
|
Sorry for the radio silence, I only got to start working on this a couple days ago. I got comparisons, bitwise operations, loads/stores, rudimentary selects, and the calling convention to work, and then I got sidetracked by trying to remove the gather and scatter intrinsics (which you wisely just kept). Looking over the patch, it roughly matches what I did, but is much more comprehensive (clang, tests).
What did you do precisely? As i16 isn't legal, we can't just bitcast v16i1 to it, can we? But as for not being able to select bitcasts, this looks like a job for these patterns (at the very end of NyuziInstrInfo.td)?
To clarify, does this always occur, or only when feeding the comparison result into a
I did |
I marked i16 as a valid type for the GPR32 register, which really doesn't seem right and probably breaks other stuff.
Heh... oops, right. It's been a few years and I had forgotten about that.
I believe any time it converts to i32 (I tried with the case wasn't with vselect32).
Okay, I didn't think to try expand in this case, but it often does something useful. I guess on the alignment it is looking at is from the vector type? It sounds like you're making good progress! |
One thought I just had (but haven't tested) is that there could be a custom SDNode that takes an i32 as an input and returns an v16i1 as a result. Then there could then be an empty rule to eliminate it during instruction selection: def i32_to_mask : SDNode<"NyuziISD::INT_TO_MASK", SDTypeProfile<1, 1, def : Pat<(v16i1 (i32_to_mask (i32 GPR32:$src))), (v16i1 GPR32:$src)>; That could replace all the custom inserter gunk and be much simpler. |
Nope :) Assertion failed: ((NodeToMatch->getValueType(i) == Res.getValueType() || NodeToMatch->getValueType(i) == MVT::iPTR || Res.getValueType() == MVT::iPTR || NodeToMatch->getValueType(i).getSizeInBits() == Res.getValueSizeInBits()) && "invalid replacement"), function SelectCodeCommon, file /Users/jeffbush/src/NyuziToolchain/lib/CodeGen/SelectionDAG/SelectionDAGISel.cpp, line 3629. |
Nevermind, the DAGs looked good but the actual code is dead wrong. The constant pool entry for a v16i1 is 16 I guess I'll have to make do with a custom lowering. The constant pool entry could be 16 bit though, as v16i1 is a legal type and can be stored and loaded normally (with |
The last one might be fixable (it might even work out fine on its own, I'm not sure) but it's super icky. I wonder if constant pool's bad behavior with i1 vectors could be fixed without breaking other targets? |
I considered making mask registers be sub registers of the scalar registers (this feature is used to support packing AH and AL registers into AX on x86), but that seems pretty ugly and hacky and would make the assembly look weird
It could do a 32-bit load and use the same trick that INSERT_VECTOR_ELT uses to convert the computed mask in into a v16i1 (in my example, the VSELECT_32 custom SDNode).
I was wondering that too. I'm guessing if it broke anything it would be the AVX stuff in the X86 backend. |
I considered this as well, but as the instruction set doesn't really have separate registers for them, so this would be yet another "lie" which is bound to complicate everything else.
In my (admittedly limited) tests with AVX-512, I couldn't get to make it generate a constant pool entry for any masks. This makes sense as it has moves with 16 bit immediates, so it can generate sequences like this:
I do have good news though! As a nop SDNode for v16i1<->i32 casts did not work, I added pseudo instructions for those operations as well and replaced them with moves after instruction selection. (Perhaps the pseudos aren't even needed and the SDNodes can just be matched to moves during isel.) This seems to work so far, and would limit the excessive number of special SDNodes for various operations (just lower them all normally, using these two nodes for converting between v16i1 and i32 where needed). It generates superfluous move instructions, but this is relatively small overhead, and can hopefully be optimized by MIR-level peephole optimizations. |
Replacing the moves seems like a valid approach. I was also thinking this would probably be easier if I changed the ISA to have vectors with 32 lanes. :) I've been trying to put my finger on what fundamental assumption of LLVM this architecture is violating that is causing so many problems. |
Quick update: I've now applied this approach to the lowering of BUILD_VECTOR, INSERT_VECTOR_ELT (including support for v16i1), and VECTOR_SHUFFLE (replacing the vselecti32 kludge and consequently the SYNTH_MOVE_* pseudos). It works wonderfully so far. I even convinced LLVM to optimize out some moves by emitting generic |
Oh, by the way, this test confuses me. It expects a 7 as the mask, but shouldn't it be either 2^7 or 2^(15-7)? (The lowering generates the latter, I'll assume that's correct.) Indeed if I add the missing |
Yes, you're right, good catch. Nice work! that sounds like a pretty straightforward solution. |
I've started to work through failing test cases and I was reminded of this part of this code: // Native vector compare instructions return a bitmask. This function
// returns v16i32 from a comparison by doing a predicated transfer.
// clang seems to assume a vector lane should have 0xffffffff when the
// result is true when folding constants, so we use that value here to be
// consistent, even though that is not what a scalar compare would do.
SDValue expandVectorComparison(SDValue Op, SelectionDAG &DAG) { Of course, the whole issue stops making sense when comparison results are v16i1 instead of v16i32. Do you know what breaks if that assumption is violated? Or alternatively, in what ways does clang assume this? Ideally this could be removed, but I don't want to inadvertently break even more things. Edit: Is this really clang or is it LLVM passes? In the latter case, maybe |
I don't think this behavior is affected by the change. Clang seems to follow the behavior of GCC vector extensions (https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html), which states:
So, in the following code... typedef int veci16_t __attribute__((ext_vector_type(16)));
veci16_t is_greater(veci16_t a, veci16_t b)
{
return a > b;
} The type of the expression "a > b" is veci16_t (v16i32). It emits LLVM IR:
At some point this gets turned into a SETCC node with the type v16i32. This is the case I'm trying to handle. |
I see. I tried a few examples and I couldn't manage to produce a setcc node with type v16i32, though, I always got a normal (v16i1) setcc with a sign_extend. This is exactly what I would expect given the IR generated by clang. You don't happen to have an example at hand that produces it? I'm asking both to understand, and because I'm awfully tempted to move the code from |
Yes, I think you're right that it would make sense to move that from LowerSETCC to LowerSIGN_EXTEND.
If I compile the following with the old compiler:
...I see the following transformation:
The promotion of t6 occurs during type legalization. This no longer happens as mentioned above. When the lowering for SETCC occurs, I generated a predicated move of -1 or 0 (the vector comparison is supposed to put 0xffffffff into each lane if the comparison is true):
Then it also lowers the sign extension:
The sign extension is unnecessary, since I've already put the extended value into the registers in expandVectorComparison. The final assembly is:
The last two instructions are the sign extension. If you did the check in LowerSIGN_EXTEND, it would avoid that. |
I came across an unrelated (edit: well, kinda) issue in setcc lowering, perhaps you have an idea how to do it better: Unordered float comparisons are lowered to the complementary comparison, for example SETUGT -> NOT SETOLE. The code previously did negation by xor'ing with an i32 0xFFFF constant. Strangely, despite looking like an all-ones mask, this only happened for scalar comparisons (maybe precisely because of the issue I've encountered below?). Even more strangely, vector comparisons were mapped straight to the intrinsics without accounting for unordered comparisons (e.g., vector define i32 @cmpfgtu(float %a, float %b) { ; CHECK-LABEL: cmpfgtu:
%cmp = fcmp ugt float %a, %b
%ret = zext i1 %cmp to i32
; CHECK: cmple_f [[CMPRES:s[0-9]+]], s0, s1
; CHECK: xor s{{[0-9]+}}, [[CMPRES]]
ret i32 %ret
} Turns out that the DAG combiner can't see through constant pool loads, but will happily "simplify" |
The scalar comparison instruction returns 0xffff when the value is true. This is an artifact of the hardware implementation (scalar values are duplicated to all lanes). At some point, the backend ands this with 1 so it returns what the compiler expects, but I don't remember off the top of my head where or how that happens:
Yes. I didn't bother to implement the unordered comparison builtins/intrinsics because they didn't seem that useful. I had only implemented the scalar versions for completeness (the compiler always seems to emit ordered versions).
Yeah, definitely. Is that transform happening here?
Even if not, I think I have an idea of why this may be happening. Generally, as I understand, the DAG combining code should call into TargetLowering to determine if an operation is legal and skip the transform if it is not. The backend sets unordered operations as
I would have assumed that a custom lowering would not be legal, but this is not the case:
So I may be violating an assumption/unwritten rule in the LLVM backend that one shouldn't use a custom lowering to convert to other legal node types. If that is the case, I see two possible approaches. The first is to use the Expand action for these types. It appears the default action may do the right thing:
If the type were 'Expand', it would not be treated as legal, so it wouldn't be converted back to an unordered operation. The other approach--and I think I've seen other backends do this--is to create a custom target-specific SETCC SDNode type, which the DAG combiner won't recognize. This precludes further optimizations, but that's basically the point (since the custom action is deliberately creating a less optimal form, and the compiler is reasonably trying to fix that). |
I've tried expanding now. It mostly works (seto, setuo, and the don't-care CCs still need custom lowering, but those aren't as problematic) but the legalization creates two comparisons: an ordered one of the same type, and a NaN check. This would be a pretty serious regression, so I'm leaning towards the custom SDNodes (I've checked ARM and it does basically that, though I also noticed it is far more aggressive with introducing target-specific SDNodes in general). |
I do wonder why it's expanded like that, though. Is is just to be conservative, or is there some edge case where the lowering you implemented does not work? |
Good question. Floating point is subtle, and I'm certainly not an expert on it. It seems like the explicit NaN check should be unnecessary in this case. The only thing I can think of is that it could make a difference on architectures that support exceptions...mumble... It looks like the code for that expansion is here (EDIT: tested and verified this code is called when expanding an unordered comparison): I don't see any interesting explanatory comments. Here's the check-in that created that code This might be a good question for the LLVM mailing list. |
Regardless, I've now implemented the custom SDNode approach. It passes all tests (after adjusting After that, the only remaining failure in the test suite was There are still some things to be done, e.g., more general support for v16i1 constants, support for v16i1 immediates on arithmetic instructions, and many more tests. But I think I can see the light at the end of the tunnel 😄 |
That's great! Have you run any of the tests in the NyuziProcessor tree yet? |
I haven't had the chance to set that up yet. Besides, I'd like to ensure the basic parts of code generation work before throwing real software at it. Speaking of which, while writing codegen tests I discovered another wrinkle. Or re-discovered, I think I encountered this while dealing with the v16i1 constant pool madness and just didn't understand it. When a vector is stored in memory, of course LLVM assumes that each vector element is separately addressable. For example, when inserting a few elements into a v16i1, the vector is stored on the stack and then the elements to be inserted are stored there (of course, the offsets are all wrong) before reading back the whole vector. So while masks can be stored compactly in scalar registers, it doesn't seem possible to avoid a 16 x i8 representation in memory. I doubt there's a nice and efficient way to implement such loads and stores, but hopefully this should be rare anyway. I'm more troubled by the prospect of having to emulate those loads and stores at all. I took a precursory stab at both (1) doing it during legalization, and (2) selecting a pseudo-instruction and emitting custom code for it. Both are very ugly — (1) is rather complicated because loads and stores are rather complicated in the SelectionDAG, and (2) can't piggy back on the legalization code that lowers vector insertions or constants larger than 13 bits. Unless you have a sudden insight that renders this moot, I might go with (2) and generate code roughly like this (for the case of a load, and surely containing some mistakes): ; s0 = address of <16 x i8>
move s1, 0 ; result acc
move s2, 1 ; mask for inserting into result
shl s2, s2, 15 ; avoid constant pool for 0x8000
loop:
load_u8 s3, (s0) ; get ext. i1
and s3, s3, 1 ; upper bits are undef
xor s4, s2, -1
and s1, s1, s4 ; zero out current element of mask
bnz s3, elem_zero ; if inserting false, leave at zero
or s1, s1, s2 ; otherwise, insert a one
elem_zero:
add s0, s0, 1 ; advance element ptr
shr s2, s2, 1 ; adjust mask
bnz s2, loop ; 16 iterations (Edit: Typos in the last three instructions.) |
It seems like X86 AVX code would run into a similar problem. They do seem to have a pattern to store a packed mask register directly to memory:
Can you elaborate on which instruction expands to this? I assume this is an INSERT_VECTOR_ELT node. If so, would it be possible to lower that directly to set of bit manipulations? |
This pattern matches an i16 store (where the i16 is bitcast'd from a v16i1), not an i1 vector store. So it's just an optimization that avoids a K register -> GPR move on certain kinds of integer stores. This somewhat matches my observation that the
No, BUILD_VECTOR (edit: my testcast was inserting into a |
How do you get getelementptr to index into vectors? llc freaks out when I try it (with the x86 backend):
I found this bit: http://llvm.org/docs/GetElementPtr.html#can-gep-index-into-vector-elements If we can remove getelementptr from the mix, then it seems like this gets easier. It looks like the default lowerings for the vector building instructions assume memory addressable elements. Since hardware doesn't support those, the backend should turn them into operations that are supported (bit manipulations in register). Maybe I'm oversimplifying this or missing something. |
This GEP works the same with and without AVX512: define i1* @getptr(<16 x i1>*) #0 {
entry:
%b = getelementptr <16 x i1>, <16 x i1>* %0, i32 0, i32 5
ret i1* %b
} It emits define i1 @extractval(<16 x i1>*) #0 {
%mask = load <16 x i1>, <16 x i1>* %0
%b = extractelement <16 x i1> %mask, i32 5
ret i1 %b
} compiles to this:
If I understood the semantics of (It's probably a complication for what I am working towards — vectorizing OpenCL style SPMD programs — but I'll burn that bridge when I come to it.) |
So between the AVX-512 backend punting on it, the warning from the docs you quoted, and the fact that all other operations that were or are problematic are easy to lower, I am more than happy to ignore v16i1 GEP and finish up the current approach :) |
Okay, that's simpler. :) The other thing is that I wouldn't expect any reasonable front-end to emit a GEP on a v16i1. |
I have one more question about the diff you posted earlier: What was the intent behind these patterns?
Bitwise operations, sure, but neither comparisons nor arithmetic seem to make sense to me (and neither do bit shifts with those operand types). |
Those don't make sense to me either :) I'll explain the thought process that I believe ended up with me producing that code. Given the following C code:
The compiler should generate IR something like this with the new implementation (the builtin functions create bitcast/zext combinations to convert the results into ints, as discussed earlier)
During some optimization stage, it figures out that it can optimize away the redundant bitcast/zext combos:
When I was originally implementing the backend with v16i1 masks, I first assumed I only needed to implement arithmetic on i32 types, since the front end always converted them. However, it asserted when a transform like the one above occurred because I hadn't implemented AND on v16i1. "But," I reasoned, "I can do arbitrary arithmetic on masks." For example, during rasterization, I have something like the following:
I didn't think very hard about this, I just assumed needed to support all arithmetic operations on v16i1. As I look at it now, that doesn't make much sense. It seems unlikely the compiler would shuffle the bitcast/zext on operations that are not bitwise. At least I hope not. That would break a lot of stuff. :) So it's probably sufficient to just implement the bitwise logical operations on v16i1: and, or, not. |
Good, so I removed those patterns. Status update: I've added all the llvm-lit tests I wanted, and they all pass. I've moved on to running the tests in the NyuziProcessor repo, and I'm encountering failures on various tests in define i1 @bar(float %x) {
entry:
%x.is.pos = fcmp ogt float %x, 0.000000e+00
br i1 %x.is.pos, label %return-true, label %return-false
return-true:
ret i1 true
return-false:
ret i1 false
} This generates:
The |
I can't think of an obvious reason why that would happen off the top of my head, unfortunately. I don't think I ever explicitly compared to 0xffffffff in the lowering code. |
Okay, it's my own damn fault. The comparison with -1 is what the DAG combiner "optimizes" the negation of the float comparison into (a previous DAG combiner pass turns the ogt into an ult). It compares with -1 instead of 0xffff because I used Why there even is a negated comparison when
|
At least once of the remaining test failures is related to how masks are built. I can't pin point any more why exactly I thought that (I believe I saw something to that effect in the existing lowering code), but I was working under the assumption that lane index 0 is the most significant bit and lane index 15 the least significant bit. But in the face of Edit: I think I found the place that lead me to this belief: NyuziToolchain/lib/Target/Nyuzi/NyuziISelLowering.cpp Lines 971 to 988 in bad1750
|
Yes, this is confusing and I usually end up reversing it a few times every time I work on this. I'll try to hopefully not muddle things further. :) EDIT: Ugh... I totally muddled things, should be correct now. LLVM uses For a block vector store, lane 0 is the lowest address stored and lane 15 is the highest address stored. This corresponds to the lane numbering convention used by INSERT_VECTOR_ELT/EXTRACT_VECTOR_ELT. However, Nyuzi maps its lane 15 to the least significant bit in the mask and lane 0 to the most significant. |
For example, assume the following code: typedef int veci16_t __attribute__((ext_vector_type(16)));
veci16_t value;
int main(void)
{
value[0] = 99;
printf("%d\n", ((volatile int*) &value)[0]);
} This generates the following LLVM IR (using the stock x86 clang compiler on MacOS):
Note that insertelement uses lane 0. It prints the following output when I run on my x86 desktop:
So lane 0 corresponds to the lowest address. If I compile using the Nyuzi compiler, the write to the vector lane generates:
In this case, the 15th bit (MSB) corresponds to the lowest address. |
Likewise, the getlane instruction uses the same lane index as LLVM/C: int foo(veci16_t value) { return value[0]; } Compiles to:
|
I've been tearing my hair out tracking down the last test failure ( #include <stdint.h>
#include <stdio.h>
int main()
{
vecf16_t items = 1000000000.0;
float value = 42.0;
items = __builtin_nyuzi_vector_mixf(0x8000, vecf16_t(value), items);
printf("%g\n", items[0]); // CHECK: 42.0
return 0;
} This passes with the old toolchain, and also in the new toolchain with -O0, but with -O1 and higher it gets optimized to I really don't see how to work around this. Perhaps doing a bit-reverse when converting between mask-as-int and mask-as-i1-vector? That would be a serious performance regression, though — not to mention that it would break C code that deals with masks numerically, including the very test case from which the above is derived. |
Okay, after my last comment it crossed my mind briefly that the order of mask bits might be reversed, but I convinced myself (incorrectly, apparently), that it was the same. So the right fix is probably to change the instruction set to reverse the order of the mask bits (that's the beauty of a custom instruction set. :) This would basically be a follow-on to the change I made previously reversing the lane numbering to match LLVM's convention. Does that sound right? I can take a whack at implementing that on the hardware/emulator side. |
That does feel like the Right Way.
Please do. I'll tackle the aforementioned crappy code generation for branch-on- |
This is related to issue #44. When bitcasting an integer to a <16 x i1> mask, the order was reversed from what Nyuzi was doing. Change Nyuzi instruction set to match LLVM.
This is part of issue: jbush001/NyuziToolchain#44 LLVM uses the opposite bit order when converting an integer mask to a vector mask. Adjust instruction set to match.
I still have no idea why branching on an fcmp generates such terrible code, but I did notice that it was basically the same before my changes. The only difference is that it used to |
… instructions. Fixes jbush001#44 Co-Authored-By: Jeff Bush <jeffbush001@gmail.com>
… instructions. Fixes jbush001#44 Co-Authored-By: Jeff Bush <jeffbush001@gmail.com>
For example, this C function (chosen for simplicity, not because it's code I would want to write) simply returns its first argument times two, if we ignore NaN for a second:
When compiled with
clang -O -ffast-math
, it results in IR like this, which is correct but not very optimized:So, the comparison did not get optimized into a constant and subsequently the mix did not get optimized out either. By the way, hard-coding the mask (i.e., replacing the first argument with
i32 0
or something) does not optimize the mix either.I conclude that both families of intrinsics are almost entirely opaque to the optimizer. (The TableGen files do indicate that they don't access memory. This does allow optimization of surrounding code but doesn't help with optimizing the intrinsics themselves.)
Okay, what if I am not writing C code, but instead generating LLVM IR from an alternative frontend? I hoped that LLVM's first-class vector operations could help, so I wrote this IR by hand:
The above indeed gets optimized to a simple
return a + a;
, but that's only because theselect
is optimized out entirely before the backend is involved. Code that actually needs a vector-typedselect
(e.g., changing the comparison tofcmp ogt <16 x float> %a, %b
) crashes in ISel with messages likeLLVM ERROR: Cannot select: t12: v16f32 = vselect t26, t11, t4
. As a point of comparison, the x86 backend happily accepts such code, even if it has to emulate 16-wide vectors and blends within the limits of SSE 2 (and it does quite a good job!).The question, then, is how to fix this issue. To be clear, I am not asking you, @jbush001, to fix this. I hopefully can do that on my own. But I do have some questions, which mostly boil down to: Why do these intrinsics exist at all? Is it just that they predate the addition of vector-typed comparisons and
select
instructions to LLVM? Or is there a deeper reason, a semantic difference perhaps?If it's just a matter of history, I would try to eliminate these intrinsics from the Nyuzi backend, change Clang to lower these
__builtin_*
functions to first-class IR instructions, and change the backend to match on these instructions (resp. their SelectionDAG equivalent) rather than the intrinsics. Does that sound like a plan?The text was updated successfully, but these errors were encountered: