In the last few blogs we examined a variety of different NoSQL databases and considered some of their appropriate domains and use cases. In this edition we conclude our multi-part database blog series by considering Graph database offerings from different vendors. This is done in an attempt to highlight some of the key differences between otherwise similar technologies. We also consider the differences in performance and what costs are associated to running each database on premises or in the cloud, so that it may become clearer and easier to recognize the database that best suits your needs.
5. NoSQL - Graph Databases
|Description||A multi-model DBMS||Open source graph database||Multi-model DBMS|
|Primary DB Model||Graph DBMS||Graph DBMS||Graph DBMS|
|Additional DB Models||Document Store
|Popularity Ranking (DBs Overall)||#63||#22||#49|
|Popularity Ranking (in Graph Stores)||#5||#1||#4|
|Developer||Arango GmbH||Neo4j, Inc.||OrientDB Ltd.|
|Current Release||3.2.4, September 2017||3.3.3, February 2018||2.2.22, June 2017|
|License||Open Source||Open Source||Open Source|
|Server Operating Systems||Linux
|All OS with Java JDK|
|Data Scheme||Schema-free||Schema-free and schema-optional||Schema-free|
|XML Support||Not Standard, need JSON translator||Not Standard, need JSON translator||No|
|SQL||No||No||SQL-like Query Language, No joins|
|APIs / Access Methods||HTTP API
|Cypher Query Language
RESTful HTTP API
Spring Data Neo4j
RESTful HTTP/JSON API
TinkerPop Stack (with Blueprints, Gremlin, Pipes)
|Supported Programming Languages||C#
|Replication Methods||Master-Slave Replication||Causal Clustering using Raft Protocol||Master-Master Replication|
|Consistency Concepts||Eventual Consistency
|Causal and Eventual Consistency (configurable in causal cluster setup)
Immediate consistency (in stand-alone mode)
|Foreign Keys||No||Yes (relationships in graphs)||Yes (relationships in graphs)|
|User Concepts||Access Control Lists (ACL) per each Arango Server||Users, roles and permissions, Pluggable authentication with supported standards (LDAP, Active Directory, Kerberos)||Access rights for users and roles, record level security configurable|
Graph Databases have gained increasing popularity in recent years due to their ability to model complex relationships between objects. At a first glance the three open source databases compared here look quite similar, but when we looked closer it became clear why Neo4j ranks first, as the most popular graph database. All three databases support a variety of operating systems including Linux, OS X and Windows, but only Neo4j offers both schema-based and schema-free data structuring.
However, Neo4j does not offer sharding or any other partitioning methods like the other two databases and it uses a proprietary Raft protocol to replicate data into causal clusters, rather than master-master or master-slave replication. This also results in the database exhibiting causal and eventual consistency across clusters when configured to replicate, and immediate consistency only in the stand-alone mode.
All three databases have ACID transactions but only Neo4j and OrientDB allow for foreign keys to be used, for example the relationships in the graphs. Finally, only ArangoDB and OrientDB have in-memory capabilities for fast information reads, but Neo4j offers pluggable authentication with supported standards such as LDAP, Active Directory or Kerberos, making it easier for corporations to manage access controls with configurations they already have.
OrientDB and ArangoDB are both native multi-model DBs whereas Neo4j is strictly a graph database. This considered it would be expected that Neo4j has been optimized for graph specific operations such as shortest path and neighbors second, and thus outperform the other two databases. What we found however, reflected a completely different picture.
For single reads ArangoDB clearly had higher throughput than Neoj4 or OrientDB, and the difference between them was fairly linear with Neo4j coming in second with 50% less throughput than ArangoDB and OrientDB 50% less than Neo4j. In terms of single writes the difference was minimal with ArangoDB performing ever so slightly better than OrientDB. For single write synchronizations ArangoDB also wins over Neo4j with 50% higher throughput.
In terms of aggregation ArangoDB clearly takes the lead outperforming Neo4j with nearly twice as much throughput. Unfortunately the figures for OrientDB were disappointing, as the database was found to be taking 25x as long as ArangoDB to aggregate over a single collection (for example, computing the age distribution for everyone in the sample network from SNAP’s Pokec).
For specific graph operations such as shortest path analysis, Neo4j scored a win over ArangoDb and OrientDB when ArangoDB used its standard storage engine configuration “MMFiles”, with ArangoDB taking 3x and OrientDB 12x as long to complete the op. However, when ArangoDB was used with the new and improved “DBRocks” storage engine, then it outperformed all other databases by a mile. When searching for distinct and direct neighbors (highly related data profiles), ArangoDB comes in first with Neo4j twice as slow and OrientDB approximately 6x as slow. An attempt to locate neighbors of neighbors with their contained data resulted in a very similar picture with Neo4j having half as much throughput as ArangoDB and OrientDB half that again.
Finally, in terms of memory usage Neo4j used 2.5x times as much memory as ArangoDB, or OrientDB. Although a very minimal difference, ArangoDB used 7% less memory on average than OrientDB. However the results for performance as a whole clearly favor ArangoDB, with Neo4j in second and OrientDB in a distant third place.
The cost aspect of this comparison is rather simple. ArangoDB Community is free, and the Basic and Enterprise versions are paid. ArangoDB Basic offers a little more functionality than Community and a basic SLA with 9x5 support, and comes in at around $15,000 per year. The full fledged multi-modal functionality with all the interesting features like S2S replication, satellite collections, smartgraphs and additional security comes with the Enterprise addition starting at $36,000 per year in licensing fees alone. There are different subscriptions for support and varying tiers of SLA some of which are included and some of which come at a surcharge. Its a big price tag but from what we have seen it promises what it delivers.
Neo4j also offers a community version for free under a GPL v3 License, and has 4 different versions under its paid license tier; Commercial, Developer, Evaluation and Educational License. The commercial licenses go from anywhere between $299 per month to $599 per month depending on the level of SLA chosen (standard, premium, or enterprise). The pricing above reflects hosting a 4GB DB on 2 cores with 32GB SSD on AWS, and naturally prices increase as more memory and cores are added. Some minor discounts can be achieved by hosting on Azure (approx. -10%) or GCP (approx. -20%)
OrientDB has a free community edition and a paid Enterprise edition which comes in at $3,125 annually per core for production/live environments and $1,600 per core per year for Test/Dev environments. There is an initial minimum purchase of 6 cores and support at a flat rate fee of $15,000 per year (for an unlimited number of cores). Assuming 4 production cores and 2 Test/dev cores, plus support, this brings the initial setup fees to start with OrientDB on an enterprise level to a grand total of $30,700, which is approximately $2,558 per month, and quite a lot for the performance it offers.
There are many different reasons to choose a certain type of database solution, but what we have done here is looked at some of the key features of some of the most common offerings on the market and examined the differences between them. We do this in an attempt to give a better picture of what is available and what is appropriate in which use case. While this blog series highlights some of the aspects we consider to examine the needs of our clients before making any recommendations on a Big Data solution, we acknowledge that there are many other factors that can be considered as important for each use case.
One high level approach we have found to be useful when deciding between tradeoffs of different solutions is the application of the CAP theorem. Based on availability and consistency needs of the client it can become clear if a big data or a relational database solution is more appropriate. Furthermore, the amount of writes, and the type of queries should be considered to determine if range-based queries are needed or if fast writes are needed. Answering these questions can help navigate the many different options that are out there to come up with a solution that is right for your specific needs.
To conclude this multi-part blog series, we would like to note that the views expressed in these blogs represent solely our opinions and experiences over the years, and we recommend to developers and architects that they always test their own use cases before committing to any technology.
Whilst we are avid technology geeks ourselves and love the nitty-gritty lugs and bolts, kernel profiling and digging through stack traces, we also recognize the need for a higher-level, more digestible approach to understanding the cloud computing landscape. From this origin and perceived need the AVM Consulting Business Blog series has a slightly different tone, aimed at business or management professionals and decision makers. We hope that this series of cloud business blogs will provide valuable information and new insights into the otherwise highly technical and rapidly changing cloud environment. Lastly, it is important to note that the views expressed in these blogs merely represent the opinions, perspectives, and point of view of AVM Consulting, and although some of the findings are based on facts, the meat of the content is purely subjective and open to interpretation. This is what we think, do what you will with this information.