Let’s build something Outrageous – Part 15: Connected

Checking if two nodes are directly connected is something you often have to do in a graph. There are a few different ways to implement this feature depending on how the database keeps track of relationships. In Neo4j a double linked list of relationships is kept per node grouped by the relationship type in both the incoming and outgoing directions. To check if two nodes are directly connected, one has to traverse one of the lists (preferably the shortest one) and checking to see if the other node id is included in that list. If we don’t know the relationship type, we have to check all the groups (for dense nodes, or light nodes there are no groups and we check them all anyway).

In Amazon Neptune the SPOG index can be queried twice. Once with the first node in the S position and the second node in the O position, then again with the positions reversed (with the P position being the relationship type). If we don’t know the relationship type we can query the indexes twice per relationship type.

Checking if two nodes are directly connected is similar to checking for set membership, and one trick we could use is a bloom filter and variant data structures. Long time readers will remember this blog post outlining exactly how to do that and achieve 100x faster checks including a “double check” to get around the probabilistic nature of these data structures.

To keep a 1% false positive rate on 1 Billion relationships assuming each relationship entry takes 10 bits would cost us 1.25 GB in RAM. Today I would say that’s a high price to pay. If we project RAM out into the future it won’t be, so we’ll save implementing this idea for then. Instead today we’ll do it the old fashioned way: we’ll traverse the graph and check for the other node.

Starting backwards, let’s first work on how the http api for this will look:

:GET /db/{graph}/node/{type}/{key}/connected/{type2}/{key2}
:GET /db/{graph}/node/{type}/{key}/connected/{type2}/{key2}/{direction [all, in, out]}
:GET /db/{graph}/node/{type}/{key}/connected/{type2}/{key2}/{direction [all, in, out]}/{type TYPE_ONE}
:GET /db/{graph}/node/{type}/{key}/connected/{type2}/{key2}/{direction [all, in, out]}/{type(s) TYPE_ONE&TYPE_TWO}

We’ll do the same for node ids instead of node type/keys. We’re borrowing the same http url style from “Degree“, “Neighbors” and “Relationships” but we’ll return only a list of matching Relationships. We could have returned true/false but I think seeing the actual relationship(s) is better, the user can decide to use that information or discard it by converting an empty set into false. Let’s start off by creating our Connected http handler declaration. Standard boilerplate stuff here:

class Connected {
  class GetConnectedHandler : public httpd::handler_base {
  public:
    explicit GetConnectedHandler(Connected& connected) : parent(connected) {};
  private:
    Connected& parent;
    future<std::unique_ptr<reply>> handle(const sstring& path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) override;
  };
  ...
  public:
    explicit Connected(Graph &_graph) : graph(_graph), getConnectedHandler(*this), getConnectedByIdHandler(*this) {}
    void set_routes(routes& routes);
};

Next we will wire in the routes. I’ll show you the code for node type/key, since the code for node ids is basically more of the same:

void Connected::set_routes(routes &routes) {
  auto getConnected = new match_rule(&getConnectedHandler);
  getConnected->add_str("/db/" + graph.GetName() + "/node");
  getConnected->add_param("type");
  getConnected->add_param("key");
  getConnected->add_str("/connected");
  getConnected->add_param("type2");
  getConnected->add_param("key2");
  getConnected->add_param("options", true);
  routes.add(getConnected, operation_type::GET);

The real magic happens in “options”. We’ll get there soon. Let’s start by creating our handler function. First we validate our parameters to make sure we aren’t getting blank node types or keys:

future<std::unique_ptr<reply>> Connected::GetConnectedHandler::handle([[maybe_unused]] const sstring &path, std::unique_ptr<request> req, std::unique_ptr<reply> rep) {
  bool valid_type = Utilities::validate_parameter(Utilities::TYPE, req, rep, "Invalid type");
  bool valid_key = Utilities::validate_parameter(Utilities::KEY, req, rep, "Invalid key");
  bool valid_type2 = Utilities::validate_parameter(Utilities::TYPE2, req, rep, "Invalid type 2");
  bool valid_key2 = Utilities::validate_parameter(Utilities::KEY2, req, rep, "Invalid key 2");

  if(valid_type && valid_key && valid_type2 && valid_key2) {

Let’s capture the URL options and handle the default empty case:

    std::string options_string;
    Direction direction = BOTH;
    options_string = req->param.at(Utilities::OPTIONS).c_str();

    if(options_string.empty()) {

What we want to do here is just get any relationships between these two nodes, regardless of type or direction. So we’ll send our local shard off to run NodeGetConnectedPeered but we’ve overloaded that function to take a second node type and key as parameters which it will use during filtering. It will return a list of relationships, which we turn to JSON and send back to our user:

      return parent.graph.shard.local().NodeGetConnectedPeered(req->param[Utilities::TYPE], req->param[Utilities::KEY], req->param[Utilities::TYPE2], req->param[Utilities::KEY2])
        .then([rep = std::move(rep)] (const std::vector<Relationship>& relationships) mutable {
          std::vector<relationship_json> json_array;
          json_array.reserve(relationships.size());
          for(Relationship r : relationships) {
            json_array.emplace_back(r);
          }
          rep->write_body("json", json::stream_object(json_array));
          return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
        });

So what exactly happens in NodeGetConnectedPeered? First we calculate the shard ids of the two nodes in our request. If they happen to belong to the same shard, then we can get the node id of the second node and use it as a filter in NodeGetRelationshipsIDs and pass those results to RelationshipsGetPeered.

  seastar::future<std::vector<Relationship>> Shard::NodeGetConnectedPeered(const std::string& type1, const std::string& key1, const std::string& type2, const std::string& key2) {
    uint16_t shard_id1 = CalculateShardId(type1, key1);
    uint16_t shard_id2 = CalculateShardId(type2, key2);

    // Shortcut if the shards are the same
    if (shard_id1 == shard_id2) {
      return container().invoke_on(shard_id1, [type1, key1, type2, key2, this](Shard &local_shard) {
        uint64_t node_id2 = local_shard.NodeGetID(type2, key2);
        std::vector<Link> links = local_shard.NodeGetRelationshipsIDs(type1, key1, node_id2);
        return RelationshipsGetPeered(links);
      });

    }

However, if the shards are not the same, we need to go to the shard of node 2 and get its id, then do as above.

    return container().invoke_on(shard_id2, [type2, key2](Shard &local_shard) { return local_shard.NodeGetID(type2, key2); }).then([this, shard_id1, type1, key1](uint64_t node_id2) {
      return container().invoke_on(shard_id1, [type1, key1, node_id2, this](Shard &local_shard) {
        std::vector<Link> links = local_shard.NodeGetRelationshipsIDs(type1, key1, node_id2);
        return RelationshipsGetPeered(links);
      });
    });

Ok, so what exactly happens in NodeGetRelationshipsIDs when we pass a node to filter? We validate the first node id, then get the outgoing relationships of that node and copy only those which contain the second node id. Then we do the same for the incoming relationships and return the combination.

    std::vector<Link> Shard::NodeGetRelationshipsIDs(uint64_t id, uint64_t id2) {
      if (ValidNodeId(id)) {
        uint64_t internal_id = externalToInternal(id);
        uint16_t type_id = externalToTypeId(id);
        std::vector<Link> ids;
        for (const auto &[type, list] : node_types.getOutgoingRelationships(type_id).at(internal_id)) {
          std::copy_if(std::begin(list), std::end(list), std::back_inserter(ids), [id2](Link link) { return link.node_id == id2; });
        }
        for (const auto &[type, list] : node_types.getIncomingRelationships(type_id).at(internal_id)) {
          std::copy_if(std::begin(list), std::end(list), std::back_inserter(ids), [id2](Link link) { return link.node_id == id2; });
        }
        return ids;
      }
      return std::vector<Link>();
    }

Nodes can have zero to millions of relationships. So this is another good place we can use SIMD instructions to help us. More on that in a future installment.

So what about the options captured in the URL? What ever happened to those? Let’s to back to the handler and deal with them. First we want to figure out if they have a direction.

    std::vector<std::string> options;
    boost::split(options, options_string, [](char c){return c == '/';});

    // Erase empty first element from leading slash
    options.erase(options.begin());

    // Parse Direction
    boost::algorithm::to_lower(options[0]);
    if (options[0] == "in") {
      direction = IN;
    } else if (options[0] == "out") {
      direction = OUT;
    }

If the options vector is just 1 element, then we know it’s just a direction so deal with it.

    switch(options.size()) {
    case 1:
      // Get Node Degree with Direction
      return parent.graph.shard.local().NodeGetConnectedPeered(req->param[Utilities::TYPE], req->param[Utilities::KEY], req->param[Utilities::TYPE2], req->param[Utilities::KEY2], direction)
        .then([rep = std::move(rep)] (const std::vector<Relationship>& relationships) mutable {
          std::vector<relationship_json> json_array;
          json_array.reserve(relationships.size());
          for(Relationship r : relationships) {
            json_array.emplace_back(r);
          }
          rep->write_body("json", json::stream_object(json_array));
          return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
        });

If not we need to figure out if we have a single relationship type:

    case 2: {
      // Get Node Degree with Direction and Type(s)
      std::vector<std::string> rel_types;
      // Deal with both escaped and unescaped "&"
      boost::split(rel_types, options[1], boost::is_any_of("&,%26"), boost::token_compress_on);
      // Single Relationship Type
      if (rel_types.size() == 1) {
        return parent.graph.shard.local().NodeGetConnectedPeered(req->param[Utilities::TYPE], req->param[Utilities::KEY], req->param[Utilities::TYPE2], req->param[Utilities::KEY2], direction, rel_types[0])
          .then([rep = std::move(rep), rel_type = rel_types[0]] (const std::vector<Relationship>& relationships) mutable {
            std::vector<relationship_json> json_array;
            json_array.reserve(relationships.size());
            for(Relationship r : relationships) {
              json_array.emplace_back(r);
            }
            rep->write_body("json", json::stream_object(json_array));
            return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
          });
      }

Or multiple relationship types:

      // Multiple Relationship Types
      return parent.graph.shard.local().NodeGetConnectedPeered(req->param[Utilities::TYPE], req->param[Utilities::KEY], req->param[Utilities::TYPE2], req->param[Utilities::KEY2], direction, rel_types)
        .then([rep = std::move(rep)] (const std::vector<Relationship>& relationships) mutable {
          std::vector<relationship_json> json_array;
          json_array.reserve(relationships.size());
          for(Relationship r : relationships) {
            json_array.emplace_back(r);
          }
          rep->write_body("json", json::stream_object(json_array));
          return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
        });
    }

I’ll spare you the code for the different options and node ids vs types and keys, but you can see them in all their glory on the github repository. Alright, that’s it until the next installment. If you want to help out please reach out to me or join the Slack.

Oh hey, before you go. Did you know the movie “The Mitchells vs The Machines” was renamed “Connected” and renamed back because there are like a dozen movies called “Connected” already?

See what did I tell you:

Max De Marzi

Graphs, Graphs, and nothing but the Graphs