I think that some of the most interesting analytic problems are graph problems. I'm always interested in new ways to store and access graphs. As such, I really like the work being done by Tinkerpop to create Open Source Software to make property graphs more accessible over a wide variety of datastores. Since key-value stores like Oracle NoSQL Database are well-suited to storing property graphs, I decided to extend the Blueprints API to work with it. Below I'll discuss some of the implementation details, but you can check out the finished product here: http://github.com/dwmclary/blueprints-oracle-nosqldb.
What's in a Property Graph?
In the most general sense, a graph is just a collection of vertices and edges. Vertices and edges can have properties: weights, names, or any number of other traits. In an undirected graph, edges connect vertices without direction. A directed graph specifies that all edges have a head and a tail --- a direction. A multi-graph allows multiple edges to connect two vertices. A "property graph" encompasses all of these traits.
Key-Value Stores for Property Graphs
Key-Value stores like Oracle NoSQL Database tend to be ideal for implementing property graphs. First, if any vertex or edge can have any number of traits, we can treat it as a hash map. For example:
Vertex["name"] = "Mary"
Vertex["age"] = 28
Vertex["ID"] = 12345
and so on. This is a natural key-value relationship: the key "name" maps to the value "Mary." Moreover if we maintain two hash maps, one for vertex objects and one for edge objects, we've essentially captured the graph. As such, any scalable key-value store is fertile ground for planting graphs.
Oracle NoSQL Database as a Scalable Graph Database
While Oracle NoSQL Database offers useful features like tunable consistency, what lends it to storing property graphs is the storage guarantees around its key structure. Keys in Oracle NoSQL Database are divided into two parts: a major key and a minor key. The storage guarantee is simple. Major keys will be distributed across storage nodes, which could encompass a large number of servers. However, all minor keys which are children of a given major key are guaranteed to be stored on the same storage node. For example, the vertices:
/Personnel/Vertex/1
and
/Personnel/Vertex/2
May be stored on different servers, but
/Personnel/Vertex/1-/name
and
/Personnel/Vertex/1-/age
will always be on the same server. This means that we can structure our graph database such that retrieving all the properties for a vertex or edge requires I/O from only a single storage node. Moreover, Oracle NoSQL Database provides a storeIterator which allows us to store a huge number of vertices and edges in a scalable fashion. By storing the vertices and edges as major keys, we guarantee that they are distributed evenly across all storage nodes. At the same time we can use a partial major key to iterate over all the vertices or edges (e.g. we search over /Personnel/Vertex to iterate over all vertices).
Fork It!
The Blueprints API and Oracle NoSQL Database present a great way to get started using a scalable key-value database to store and access graph data. However, a graph store isn't useful without a good graph to work on. I encourage you to fork or pull the repository, store some data, and try using Gremlin or any other language to explore.