Improving on std:count_if()'s auto-vectorization

134 points by nicula 4 months ago

grandempire 4 months ago

It’s a good example to illustrate how to get more simd from the compiler

But the overly specific constraint means this is not a general count_if algorithm.

For this to be useful I have to: - know there are only 255 true values - but have a large dataset so it’s worth optimizing - not want to stop early when some threshold is met

This is so specialized it’s not even worth having a generic predicate argument for.

aqrit 4 months ago

A optimized version would use 64-bit accumulators (`psadbw` on SSE2, or some sort of horizontal adds on NEON). The `255` max constraint is pointless.
Many programming languages/frameworks expose this operation as `reduce()`.
- dzaima 4 months ago
  
  It's not that trivial:
  The wrapping version uses vpandn + vpaddb (i.e. `acc += 1 &~ elt`). On Intel since Haswell (2013) on ymm inputs that can manage 1.5 iterations per cycle, if unroll 2x to reduce the dependency chain.
  Whereas vpsadbw would limit it to 1 iteration per cycle on Intel.
  On AMD Zen≤2, vpsadbw is still worse, but Zen≥3 manages to have the two approaches be equal.
  On AVX-512 the two approaches are equivalent everywhere as far as uops.info data goes.
- grandempire 4 months ago
  
  Reduce does not accept a predicate.
  - Sharlin 4 months ago
    
    It has no need for that. count_if is a fold/reduce operation where the accumulator is simply incremented by `(int)some_condition(x)` for all x. In Rust:
    let arr = [ 1, 3, 4, 6,7, 0, 9, -4]; let n_evens = arr.iter().fold(0, |acc, i| acc + (i & 1 == 0) as usize); assert_eq!(n_evens, 4);
    Or more generally,
    fn count_if<T>(it: impl Iterator<Item=T>, pred: impl Fn(&T) -> bool) -> usize { it.fold(0, |acc, t| acc + pred(&t) as usize) }
    
    grandempire 4 months ago
    
    I know that. But that’s still a different interface. If you have a predicate you now have to wrap that in a different closure that conforms it to a new pattern.
    This is the same argument as why have count_if if I can write a for loop.
    
    Sharlin 4 months ago
    
    Sure. But at least I interpreted the GP as just saying that the "count-if" operation can be implemented in terms of `reduce` if the latter is available.

meisel 4 months ago

Seems like you can do this sort of speed up even without the 256 constraint. Just run this sped up version, but after each set of 256 iterations, dump all the 8-bit counters to the 64-bit final result.

nicula 4 months ago

Some people already mentioned this in the r/cpp discussion. Small correction: 256 is not the correct number of iterations, since if all elements in that slice are even, then your 8-bit counter will wrap-around to zero, which can lead to a wrong answer. What you want is 255 iterations.
I've looked at the generated assembly for such a solution and it doesn't look great. I'm expecting a significant speed penalty, but I haven't had the time to test it today. Will probably do so tomorrow.
- ack_complete 4 months ago
  
  255 is not ideal either because it results in a partial vector at the end of either 15 or 31 elements. 256-V where V is the vector size is better, so 240 for SSE2/NEON or 224 for AVX2.
  This is still lower than optimal because the compiler will reduce to a uint8. Both SSE2 and NEON support reducing to a wider value by _mm_sad_epu8 and vpadal_u8, respectively. This allows for 255 iterations in the inner loop instead of 15 or 7.
  - nicula 4 months ago
    
    Great observations, thanks!
    I wrote the code that you suggested (LMK if I understood your points): https://godbolt.org/z/jW4o3cnh3
    And here's the benchmark output, on my machine: https://0x0.st/8SsG.txt (v1 is std::count_if(), v2 is the optimization from my blog post, and v3 is what you suggested).
    v2 is faster, but v3 is still quite fast.
    
    ack_complete 4 months ago
    
    Yeah, the difference you're seeing is likely due to the inner loop doing so few iterations, in addition to not being unrolled. A hand-rolled version doing 63 iterations of an x4 unrolled loop should be able to saturate the execution core (it'll be bottlenecked by load throughput). But it'll be tough to get the compiler to generate that without intrinsics.
    
    meisel 4 months ago
    
    The fastest thing I would think is to process 255 vectors at a time, using an accumulation vector with the k-th byte in the vector representing the # of evens seen for the k-th byte of the vectors in that 255-vector chunk. But, I don’t know how to make autovectorization do that. I could only do that with intrinsics.

tomn 4 months ago

another solution is to just cast the result to an uint8_t; with this, clang 19.1.0 gives the same assembly:

https://gcc.godbolt.org/z/E5oTW5eKe

nicula 4 months ago

Like @wffurr mentioned, this is indeed discussed in a footnote. I just added another remark to the same footnote:
"It's also debatable whether or not Clang's 'optimization' results in better codegen in most cases that you care about. The same optimization pass can backfire pretty easily, because it can go the other way around too. For example, if you assigned the `std::count_if()` result to a local `uint8_t` value, but then returned that value as a `uint64_t` from the function, then Clang will assume that you wanted a `uint64_t` accumulator all along, and thus generates the poor vectorization, not the efficient one."
- tomn 4 months ago
  
  I'm not sure how "it can go the other way around too" -- in that case (assigning to a uint8_t local variable), it seems like that particular optimisation is just not being applied.
  Interestingly, if the local variable is "volatile uint8_t", the optimisation is applied. Perhaps with an uint8_t local variable and size_t return value, an earlier optimisation removes the cast to uint8_t, because it only has an effect when undefined behaviour has been triggered? It would certainly be interesting to investigate further.
  In general I agree that being more explicit is better if you really care about performance. It would be great if languages provided more ways to specify this kind of thing. I tried using __builtin_expect to trigger this optimisation too, but no dice.
  Anyway, thanks for the interesting article.
  - nicula 4 months ago
    
    > I'm not sure how "it can go the other way around too" -- in that case (assigning to a uint8_t local variable), it seems like that particular optimisation is just not being applied.
    So the case that you described has 2 layers. The internal std::count_if() layer, which has a 64-bit counter, and the 'return' layer of the count_even_values_v1() function, which has an 8-bit type. In this case, Clang propagates the 8-bit type from the 'return' layer all the way to the inner std::count_if() layer, which effectively means that you're requesting an 8-bit counter, and thus Clang generates the efficient vectorization.
    However, say that you have the following 3 layers: (1) internal std::count_if() layer with a 64-bit counter; (2) local 8-bit variable layer, to which the std::count_if() result gets assigned; (3) 'return' layer with a 64-bit type. In this case the 64-bit type from layer 3 gets propagated to the inner std::count_if() layer, which will lead to a poor vectorization. Demo: https://godbolt.org/z/Eo13WKrK4 . So this downwards type-propagation from the outmost layer into the innermost layer doesn't guarantee optimality. In this case, the optimal propagation would've been from layer 2 down to layer 1 and up to layer 3.
    Note: I'm not familiar with how the LLVM optimization pass does this exactly, so take this with a huge grain of salt. Perhaps it does indeed 'propagate' the outmost type to the innermost layer. Or perhaps the mere fact that there are more than 2 layers makes the optimization pass not happen at all. Either way, the end result is that the vectorization is poor.
    
    tomn 4 months ago
    
    I've had a look at what's going on in LLVM, and we're both a bit wrong :)
    This optimisation is applied by AggressiveInstCombinePass, after the function has been completely inlined. In cases where it is applied, the i64 result of the count is truncated to i8, and this gets propagated to the counter.
    In the case where the result is assigned to a local variable, an earlier pass (before inlining) turns a truncate (for the cast) followed by a zero extend (for the return) into an and with 0xff. This persists, and AggressiveInstCombinePass then doesn't propagate this to the counter.
    I've posted some selected bits of LLVM IR here:
    https://gist.github.com/tomjnixon/d205a56ffc18af499418965ab7...
    These come from running clang with "-mllvm=-print-after-all" and grepping for "^define.*_Z20count_even_values_v1RKSt6vectorIhSaIhEE"
    This is why i don't see this as an optimisation pass "backfiring" or "go[ing] the other way around" (well, except for the "trunc,extend->and" one which we weren't talking about). Rather, it's just an optimisation not being applied. That might just be a language thing.
    
    nicula 4 months ago
    
    Thanks for looking into it!
    I modified the footnote to get rid of the misleading statements regarding the 'backfiring' of the optimization. :)
    
    fc417fc802 4 months ago
    
    Wouldn't that violate the as-if rule? If you assign to a u8 in layer 2 then the compiler must truncate regardless of the widening of the value upon return. It can't just ignore the narrowing assignment.
    
    dzaima 4 months ago
    
    At the very end there's a "movzx eax, dl", i.e. zero-extend the low 8 bits of the accumulated value.
  - tomn 4 months ago
    
    Just to correct my own comment:
    > with an uint8_t local variable and size_t return value, an earlier optimisation removes the cast to uint8_t, because it only has an effect when undefined behaviour has been triggered
    In this case, there is no undefined behaviour, because a narrowing cast to an unsigned type is well-defined. So, this could never have been a good explanation.
gblargg 4 months ago

I was hoping you could just provide an iterator_traits with a uint8_t difference type, but this is tied to the iterator type rather than specified separately, so you'd need some kind of iterator wrapper to do this.
- tomn 4 months ago
  
  Yeah, I thought about that too, but if you want to process more than 255 values this might not be valid, depending on the implementation of count_if.
wffurr 4 months ago

Which is discussed in the post and doesn’t work in GCC.
- tomn 4 months ago
  
  Oh right, I didn't see it in a couple of passes (and searching for cast); for anyone else looking it's in the 3rd footnote. Thanks.

newgre 4 months ago

Why did the compiler even chose to fetch DWORDs only in the first place? It's unclear to me why the accumulator (apparently) determines the vectorization width?

TinkersW 4 months ago

The accumulator is a vector type, with 64 bit sum you can only fit 4 into a 256 bit register.
After the loop it will do a horizontal add across the vector register to produce the final scalar result.

zombot 4 months ago

> it++

Should be ++it. Post-increment is generally more expensive, especially when you don't know the exact type you're applying it to.

listeria 4 months ago

I'd normally agree with you, but in this case this function is meant to be used for vectorizable input so it doesn't really matter since it's using a random-access iterator, otherwise you should go with the usual std::count_if().
Then again, it doesn't hurt to be pedantic.
- zombot 4 months ago
  
  It's not about being pedantic, it's about building the right habits. Building a habit of using x++ instead of ++x is suboptimal.

tdhz77 4 months ago

Wow what a title.

TheCoreh 4 months ago

Meta-question: Given how common :: is in programming languages, and how rare of a typo it is, is it really worth it for HN's title filtering logic to include a rule for automatically replacing it with a single :?

dcsommer 4 months ago

If the goal is brevity, the rules could first replace 'std::' with nothing.
- tialaramex 4 months ago
  
  This only makes sense for C++ where they weirdly namespaced their standard library after popularising the language. But as your parent points out, other languages use this naming style.
  - dcsommer 4 months ago
    
    For what languages would removing 'std::' realistically cause ambiguity for practicioners?
    
    tialaramex 4 months ago
    
    It's not exactly ambiguity, but I know I find it frustrating to read examples for language X and see oh, everybody (often even in presentations, blog pages, documentation) says foo.bar + baz.quux but you can't actually write that, what they meant was
    std::thing::foo.bar + (std::baz.::std::other::quux)
    And if you're stupid enough to just write foo.bar + baz.quux well that's nonsense and the compiler diagnostics won't have any suggestions for how to fix it, what a buffoon you are.
    I really don't enjoy this, like the trend for "narrative" cookery recipe style documentation where we're shown how to Quux a Baz [often with fragments that don't compile] in the library but given no indication whether we can Quux a Doodad or even whether that's a feature of the library's Baz or Quux or really what's going on, but hey, the author got to tell us about their trip to Tuscany so that's nice. Javadoc isn't perfect, but it's so much better than this.
nicula 4 months ago

haha I didn't even notice that

eurekabot123 4 months ago

[dead]

fsafdsaewr 4 months ago

[flagged]

jasonthorsness 4 months ago

Soon I am wondering if rather than rely on finicky auto-vectorization we’ll just have LLMs help “hand-optimize” more routines. Just like how memcmp and memcpy are optimized by hand today maybe like 20% of the program could just be LLM-assisted assembly. @ffmpeg on X thinks maybe they are starting to get it [1] and I had some success having an LLM generate working WebAssembly [2] https://x.com/ffmpeg/status/1898408922769223994?s=46

https://www.jasonthorsness.com/24

jsheard 4 months ago

Instead of replacing one finicky and temperamental approach (auto-vectorizers) with another (LLM codegen) I'd much rather see more exploration of explicit SIMD abstractions like Intel's ISPC language. Shader languages for GPUs had this figured out forever ago, there is a sensible middle-ground to be had between brittle compiler magic and no compiler at all.
- jasonthorsness 4 months ago
  
  It does seem odd that languages or standard libraries haven’t embraced some of the most common SIMD instructions more than they have, after so many decades of the instructions being available in most processors. I’ve used dotnet’s Vector libraries a bit that tries to auto-adapt to register length and falls back to software on chips that don’t support hardware instructions; it can still be pretty unwieldy and sometimes you have to use the fixed-size ones anyway. Will take a look at ISPC.
  - jsheard 4 months ago
    
    ISPC is a step above the .NET Vector stuff, it's more or less a GPU shader language except it compiles down to SSE/AVX/NEON code instead. In fact I think it was originally envisioned to be a shader language for Intel's ill-fated Larrabee GPU, since that was just meant to be an extremely wide x86 chip.
    
    garaetjjte 4 months ago
    
    Larrabee influenced ispc but was dead by that time. https://pharr.org/matt/blog/2018/04/18/ispc-origins
    
    neonsunset 4 months ago
    
    Indeed. .NET's SIMD primitives and platform intrinsics are more like the building blocks on top of which ISPC-like framework could've been built in .NET (more likely in F# since it's more flexible for libraries that want DSL-like customization).
- dzaima 4 months ago
  
  As far as the question in the OP is concerned, ISPC wouldn't help at all, as it'd necessarily go to basically equivalent LLVM IR.