Counting Triangles

Our last post was about a Triangle Count query. It referenced another blog post from Kuzu where they explained their use of a Multi Way Join Algorithm to count 41 million triangles in 1.62 seconds. Using only binary joins it would take them 51.17 seconds to achieve the same result. My attempts to run the query using Lua on Rage landed at 9.5 seconds one node at a time and 5.6 seconds using the batch queries. So that got me thinking, how about Neo4j?

I loaded the dataset into Neo4j using LOAD CSV:

LOAD CSV FROM 'https://raw.githubusercontent.com/kuzudb/kuzudb.github.io/main/data/web-berkstan/web-node.csv' AS row
CREATE (n:Page {id: row[0]});


LOAD CSV FROM 'https://raw.githubusercontent.com/kuzudb/kuzudb.github.io/main/data/web-berkstan/web-edge.csv' AS row
MATCH (n), (n2)
WHERE id(n) = toInteger(row[0]) and id(n2) = toInteger(row[1])
CREATE (n)-[:Links]->(n2);

Then I tried the long query. It took 31 seconds to come up with the answer:

MATCH (a:Page)-[e1:Links]->(b:Page)-[e2:Links]->(c:Page)-[e3:Links]->(a)
RETURN count(*);

The binary join style query came in at 23 seconds:

MATCH (a:Page)-[e1:Links]->(b:Page)
WITH *
MATCH (b:Page)-[e2:Links]->(c:Page)
WITH *
MATCH (c)-[e3:Links]->(a)
RETURN count(*)

The two queries are semantically the same, so it may be a bit surprising to see the time off by almost 10 seconds. This is of course due to the two different query plans generated by the Neo4j query optimizer. It is also a bit surprising that they both beat Kuzu’s binary join query plan of 51 seconds. I decided to take another shot at it. Let’s see the query I had. I get all the node ids, then for all of them get the outgoing and incoming neighbors much like Kuzu does to create factorized hash tables. Then for each node “a” I create a map of the “c” nodes it connects to. I loop through ab, then bc, and check if the c node exists in the map I created earlier. It looks like this:

local count = 0 
local a = AllNodeIds(0, 1000000)
local outs = NodeIdsGetNeighborIds(a, Direction.OUT, "Links") --ab and bc
local ins = NodeIdsGetNeighborIds(a, Direction.IN, "Links") --ac
 
for _, node_a in pairs(a) do
    local bs = outs[node_a]
    local cs = ins[node_a]
     
    local set = {}
    for _, node_c in pairs(cs) do
       set[node_c] = true   
    end   
     
    for _, node_b in pairs(bs) do
         for _, node_c in pairs(outs[node_b]) do
            if set[node_c] then
                count = count + 1
            end
         end    
    end    
 
end
 
count

That got me thinking… having to create this map of the “c” nodes of each “a” node is slowing me down. Wouldn’t it be better if I could store the relationship links as a map already instead and avoid the building step? So I tried it. But I can’t store relationships using a regular map of node to relationship since a node can have multiple relationships to the same node. So it had to be a multimap. I tried the standard library, I tried the new Boost unordered multi map and I tried the flat multi map. All 3 of these options came with storage overhead and varying degrees of memory jumping. Unfortunately I failed to produce anything faster than what I had before. Lua doesn’t have the concept of a multimap so the Sol library was converting the multimap into a regular map and ignoring the potential of multiple entries. Also because Rage has a Shard per Core design, it had to copy multimaps from multiple cores into the core executing the query and all that copying was harming performance. I gave up on the idea, but maybe somebody better at this can run with it.

Still, I could not stop thinking about how to count triangles faster…Now, nobody and I do mean nobody loves counting more than Count von Count from Sesame Street. So I asked him to help me understand how Kuzu was so fast. “Yoda, I am not,” he said… “I count in order”. I count in order…yeah, I have to sort the ids and compare them every time. I tried that already and it was slower. But why am I sorting ids every time, they should be sorted just once on creation and stay sorted. So that’s what we will do. Every time we add a link to our relationship chain, we will insert it into the correct spot. Is this going to cost us a bit of insert performance… yes, but I think it’s worth it.

void Shard::insert_sorted(uint64_t id1, uint64_t external_id, std::vector<Link> &links) const {
      auto link = Link(id1, external_id);
      links.insert(std::ranges::upper_bound(links, link), link);
    }

This is possible because Links can be ordered first by the node id and if those are equal then by the relationship id:

        std::strong_ordering operator<=>(const Link& that) const {
            if (auto cmp = node_id <=> that.node_id; cmp != 0) {
                return cmp;
            }

            return rel_id <=> that.rel_id;
        }

That all works well but we still don’t have a good way to get the intersection of two tables in Lua that doesn’t involve creating a map. So instead we will perform the triangle count method inside C++ and call it from Lua. Thanks to the “auto” keyword, it looks very similar to the Lua version. The “get0()” methods are performing work on all the shards and combining the results.

    uint64_t Shard::TriangleCount(const std::string& rel_type) {
        uint64_t count = 0;
        auto max = AllNodesCountPeered().get0();
        auto a = AllNodeIdsPeered(uint64_t(0), max).get0();
        auto outs = NodeIdsGetNeighborIdsPeered(a, Direction::OUT, rel_type).get0();
        auto ins = NodeIdsGetNeighborIdsPeered(a, Direction::IN, rel_type).get0();

        for (const auto &node_a : a) {
            auto bs = outs[node_a];
            auto cs = ins[node_a];
            for (const auto &node_b : bs) {
                auto cs2 = outs[node_b];
                count += IntersectIds(cs, cs2).size();
            }
        }

        return count;
    }

So was it worth it? Did the Count’s advice actually help us achieve our goal of beating Kuzu at 1.62 seconds? Oh yes! Rage is able to count these 41 million triangles in 1.357 seconds. Still, I wish I could do the IntersectIds (which is just std::set_intersection ) method in Lua directly. Maybe in assembly? Hit me up if you know how. UPDATE: We can do better. Read below…

Back up in the code you may have noticed the IntersectIds section. It was just calling std::set_intersection and getting the size:

...
count += IntersectIds(cs, cs2).size();
...

    std::vector<uint64_t> Shard::IntersectIds(const std::vector<uint64_t> &sorted_ids, const std::vector<uint64_t> &sorted_ids2) const {
        std::vector<uint64_t> intersection;
        intersection.reserve(std::min(sorted_ids.size(), sorted_ids2.size()));
        std::set_intersection(sorted_ids.begin(), sorted_ids.end(), sorted_ids2.begin(), sorted_ids2.end(),
          std::back_inserter(intersection));
        return intersection;
    }

But we’re just counting triangles, we don’t actually need to keep track of the matching node ids at all. All we need to do is count the matches. So I stole this code from Nathan Kurz and tweaked it a little bit.

    size_t Shard::IntersectIdsCount(const uint64_t *A, const size_t lenA, const uint64_t *B, const size_t lenB) const {
        size_t count = 0;
        if (lenA == 0 || lenB == 0) {
            return 0;
        }

        const uint64_t *endA = A + lenA;
        const uint64_t *endB = B + lenB;

        while (1) {
            while (*A < *B) {
            SKIP_FIRST_COMPARE:
                if (++A == endA) {
                    return count;
                }
            }
            while (*A > *B) {
                if (++B == endB) {
                    return count;
                }
            }
            if (*A == *B) {
                count++;
                if (++A == endA || ++B == endB)
                    return count;
            } else {
                goto SKIP_FIRST_COMPARE;
            }
        }
    }

Now when we run our Triangle Count algorithm we can find the 41 million triangles in 1.278 seconds:

Tagged , , , , ,

Leave a comment