Let’s build something Outrageous – Part 13: Finding Things Faster

In the previous blog post we started at 32 and managed to reach 724 requests per second. We also ended it saying that I’m going to investigate a way to vectorize it so we can check multiple node properties in batches instead of doing things one at a time. Well, I found a way and the results are nothing short of outrageous.

In our Find method we were looking at a list of property values and comparing them to a single value one at a time. Computer since the early 90s have been able to perform SIMD (Single Instruction Multiple Data) operations and we’re going to take advantage of that.

There are a bunch of SIMD libraries and no real standard yet, but I understand a std::simd is in the works. Having never done anything with SIMD before I went to youtube and ran into a presentation called “My first SIMD” from Denis Yaroshevskiy:

In it, he mentions that he looked at nsimd, tsimd, xsimd, vc, but finally settled on the eve SIMD library by Joel Falcou. I decided this was a bit over my head, so I asked Denis:

“Let’s say you have a vector with 9 values [1,2,3,1,2,3,1,2,3] and you want to retrieve the positions of the values that were equal to 2 as vector, so in our case we want to end up with 1, 4, 7 as our answer. How can you do that with Eve?”

I was hoping to get a hint on how to solve this, instead Denis sent me a 100 line working solution on godbolt! Let’s go over it together:

  std::vector<std::int16_t>  v{ 1,2,3,1,2,3,1,2,3 };
  
  std::vector<std::uint32_t> idxs = collect_indexes(v.begin(), v.end(), [](auto x) { return x == 2; });

  for (std::uint32_t i : idxs) {
    std::cout << i << ' ';
  }
  std::cout << std::endl;

We start off with a vector of 9 numbers 1,2,3 repeated 3 times like in our question. Then we declare a vector of unsigned 32 bit integers to be created from a call to collect_indexes which takes as parameters the beginning and end of our vector, and an expression which in this case asks if the number is equal to 2. Let’s see what collect_indexes does:

template <typename I, typename P>
std::vector<std::uint32_t> collect_indexes(I f, I l, P p)
{
  if (f == l) return {}; 

  using T = typename std::iterator_traits<I>::value_type;

It starts off by asking if the vector is empty (f == l) and if so just returns. Otherwise it gets the type of the iterator passed in as T. Now it gets very specific to eve, it first decides how big of a register we are going to use by taking the largest of the type passed in or our return type:

using N = eve::fixed<sizeof(T) < sizeof(std::uint32_t) ? 
      eve::expected_cardinal_v<T> : 
      eve::expected_cardinal_v<std::uint32_t> >;

We are going to iterate over a list of property values, and eve has many ways of doing that. We have to specify which traits we want our iteration to have, and in our case we have “no aligning” because if we try to align, the element indexes won’t start with 0. We have unroll set to false, because it won’t help us here and the last value sets the cardinal. According to the documentation, the cardinal of a value is the number of elements inside the corresponding SIMD register. For any scalar types, the cardinal is equal to 1. For SIMD types, it’s equal to the number of lanes in the register.

eve::algo::traits iteration_traits{
  eve::algo::no_aligning,
  eve::algo::unroll<1>,
  eve::algo::force_cardinal<N{}()>
}

Next we’ll create a vector to store the resulting indexes. It will be as large as the size of the vector we are passing in with a little extra padding of the size of our cardinal and set the values all to zero. We’ll also create a callback function to run in our iterator. We’ll get to what it does later.

std::vector<std::uint32_t> res((l - f) + N{}(), 0);

detail::collect_indexes_delegate<N, P> delegate{p, res.data()};

Next we are converting our vector into something eve can work with, creating the iteration from our traits and iterating over that callback:

auto processed = eve::algo::preprocess_range(iteration_traits, f, l);
auto iteration = eve::algo::for_each_iteration(
    processed.traits(), processed.begin(), processed.end()
);
iteration(delegate);

To wrap up collect_indexes we get the results, resize to the number of index ids returned, shrink the vector and return it.

  std::uint32_t* new_end = delegate.out.ptr;
  res.resize(static_cast<std::size_t>(new_end - res.data()));

  res.shrink_to_fit();  // optional - remove extra allocation

  return res;

Go over that part again, it may look complicated but it’s really all boilerplate code for running through a vector and returning another. We haven’t gotten to the real magic yet in collect_indexes_delegate. So let’s go there:

template <typename N, typename P>
struct collect_indexes_delegate
{

  collect_indexes_delegate(P p, std::uint32_t* out) :
    p(p),
    idx([](int i, int) { return i; }), // 0, 1, 2, 3, 4 ...
    out{out}
  {}

  P p;

  eve::wide<std::uint32_t, N> idx;  
  // This would be i in for (i = 0; i != size; ++i)

  eve::algo::unaligned_ptr_iterator<std::uint32_t, N> out;  
  // A wrapper around a pointer

P is our expression (x == 2), idx is a list of uint32_t values as large as our register and it acts as the index to our vector and out is a pointer to our results, and finally the magic that happens at every step in the iteration:

EVE_FORCEINLINE bool step(auto it, eve::relative_conditional_expr auto ignore, auto /*idx*/)
  {
    // Test the elements
    auto test = p(eve::load[ignore](it));

    // We need to convert it to the logical of index type.
    // This should be fixed after https://github.com/jfalcou/eve/issues/868
    auto idx_test = eve::convert(test, eve::as<eve::logical<std::uint32_t>>{});

    // Because we don't want to deal with ignored elements in compress store,
    // We'll do it here.
    // .mask returns `ignore` as logical
    idx_test = idx_test && ignore.mask(eve::as(idx_test));

    // Now we store the selected indexes
    out = eve::unsafe(eve::compress_store)(idx, idx_test, out);

    // Add to each index the step
    idx += N{}();

    return false;  // We don't want to break
  }

So what’s going on here? It loads a set of values from our vector and tests them against the expression we passed in. Due to a current issue we have to convert the results of our test to a logical set of uint32_t which are our return values. Since we’re testing 2,4,8, etc values at a time, we need to ignore anything past our size ( we had 9 values in our vector). We add the valid results to “out” using compress_store and increment our index (which is a list of values) each by the step.

It’s going to take some time to really understand what eve is doing. So now we get to the real question, is it worth it?

I integrated this solution into RageDB with a few minor tweaks (we want 64 bit indexes so we can have more than 4 billion nodes and relationships of a type) and reran the “find” benchmark from the previous blog post. Take a look at the results:

================================================================================
---- Global Information --------------------------------------------------------
> request count                                     253958 (OK=253958 KO=0     )
> min response time                                      0 (OK=0      KO=-     )
> max response time                                     53 (OK=53     KO=-     )
> mean response time                                     8 (OK=8      KO=-     )
> std deviation                                          2 (OK=2      KO=-     )
> response time 50th percentile                          7 (OK=7      KO=-     )
> response time 75th percentile                          9 (OK=9      KO=-     )
> response time 95th percentile                         11 (OK=11     KO=-     )
> response time 99th percentile                         12 (OK=12     KO=-     )
> mean requests/sec                                4163.246 (OK=4163.246 KO=-     )
---- Response Time Distribution ------------------------------------------------
> t < 800 ms                                        253958 (100%)
> 800 ms < t < 1200 ms                                   0 (  0%)
> t > 1200 ms                                            0 (  0%)
> failed                                                 0 (  0%)
================================================================================

Our speed jumped 5.75x. We went from 724 to 4163! That’s a massive leap in performance and absolutely worth it. I think we can reuse this collect_index function when filtering node and relationship properties while traversing later on. Be sure to scroll back up and watch the video if you haven’t yet, and please check out the eve library. Until next time.

Max De Marzi

Graphs, Graphs, and nothing but the Graphs