In the last blogs we examined a variety of different database types and considered some of their appropriate domains and use cases. In this blog we look at actual NoSQL Wide Column Store offerings from different vendors, in an attempt to highlight some of the key differences between otherwise similar technologies. We also consider the differences in performance and what costs are associated to running each database on premises or in the cloud, so that it may become clearer and easier to recognize the database that best suits your needs.
3. NoSQL - Wide Column Store
|Name||Cassandra||Google Cloud Bigtable||HBase|
|Description||Wide-Column stored based on Bigtable and DynamoDB concepts||Google’s NoSQL Big Data database service||Wide-column store based on Apache Hadoop and Bigtable concepts|
|Primary DB Model||Wide Column Store||Wide Column Store||Wide Column Store|
|Additional DB Models||None||None||None|
|Popularity Ranking (DBs Overall)||#10||#135||#17|
|Popularity Ranking (in Wide-Column Stores)||#1||#7||#2|
|Developer||Apache Software Foundation||Apache Software Foundation|
|Current Release||3.11.2, February 2018||January 2018||1.4.3, April 2018|
|License||Open Source||Commercial||Open Source|
|Implementation Language||Java||C++, Java, Python||Java|
|Server Operating Systems||BSD
|SQL||SQL-like DML and DDL statements (CQL)||No||No|
|APIs / Access Methods||Proprietary Protocol (Thrift)||gRPC (using protocol buffers) API
HBase compatible API (Java)
RESTful HTTP API
|Supported Programming Languages||C#
|Replication Methods||Selectable Replication factor||Yes, replication between instance clusters||Selectable Replication factor|
|Consistency Concepts||Eventual Consistency Immediate Consistency||Immediate Consistency||Immediate Consistency|
|Transaction Concepts||No||Atomic single-row operations||No|
|User Concepts||Access rights for users can be defined per object||Access rights for users, groups and roles defined by Google Cloud IAM||Access Control Lists (ACLs)|
Cassandra and HBase are both open source, which is part of the reason they place #1 and #2 in the Popularity ranking for wide-column stores, compared to Google’s commercial cloud-based Bigtable, which places at #7. All three databases support both Unix and Windows Operating Systems and are schema-free, but only Cassandra offers data typing.
None of the databases has native XML support and only Cassandra allows for restricted secondary indexes, whereas Bigtable and HBase do not at all. Cassandra is also the only database in this comparison which enables SQL-like DML and DDL statements to be queried using their own proprietary language called CQL (Cassandra Query Language). Whilst Bigtable and HBase both support Java APIs, HBase also supports a Cassandra proprietary protocol known as Thrift. Furthermore, HBase and Cassandra support a whole list of different programming languages while Bigtable restricts code to Go and Java. This makes HBase and Cassandra the more versatile and easy to access databases compared to Google’s Bigtable.
All three databases offer sharding as a data partitioning method, and can operate with immediate consistency, but Cassandra is the only database offering an optional configuration for eventual consistency as well. This means that Cassandra can offer low latency responses to read requests for highly time-sensitive applications, although at the risk of returning stale data because there is only eventual consistency between the nodes. While all three databases have concurrency and durability, only Bigtable offers atomic single-row operations as a transaction concept.
Both DynamoDB and Cassandra map partition key onto a token ring using constant hashing to determine where to store the data. By hashing the partition key every node is able to know the range it belongs to and from there the node in charge of this range. Availability and replication strategy depend on the implementation of the database but it means that Cassandra is a true peer-to-peer system, with no master nodes (and no single point-of-failures). It also means that you can send your queries to any node in the cluster (or even better have your driver sent the request to the most appropriate node). This makes Cassandra extremely quick at returning complicated queries.
This reflects what we have seen in practice, that HBase is slower than Cassandra, which is also supported by most of the benchmarks out there. Cassandra architecture is based on DynamoDB (AWS) and Bigtable design. It’s very fast specifically in workloads which it was designed for (there are many benchmarks for 1 million writes a second). However, Bigtable can handle pretty much everything you throw at it, with some benchmarks showing up to 2 million records/second write, although this comes at a price.
Unlike other NoSQL databases, HBase operations run in real-time on its database rather than MapReduce jobs. HBase is partitioned to tables, and tables are further split into column families. Versioning is available so that previous values of the data can be fetched (the history can be deleted every now and then to clear space via HBase compactions). This makes HBase perfect for real-time querying of Big Data. For example, Facebook use it for messaging and real-time analytics.
HBase is optimized for reads, supported by single-write master, and results in a strict consistency model, as well as use of Ordered Partitioning which supports row-scans. HBase is well suited for doing Range based scans. However, HBase isn’t fully ACID compliant, although it does support certain properties. Last but not least - in order to run HBase, ZooKeeper is required - a server for distributed coordination such as configuration, maintenance, and naming.
Google Cloud Bigtable is accessible via the HBase API. The performance of the database is comparable but somewhat faster than operating HBase on an off-the shelf server. Because Cloud Bigtable is accessed through the HBase API, it is natively integrated with much of the existing big data and Hadoop ecosystem and supports Google’s big data products. Additionally, data can be imported from or exported to existing HBase clusters through simple bulk ingestion tools using industry-standard formats. As such, Cloud Bigtable excels at large ingestion, analytics, and data-heavy serving workloads. It’s ideal for enterprises and data-driven organizations that need to handle huge volumes of data. Conclusively, while all database systems excel at handling large and complex data-heavy workloads, Cassandra has demonstrated the best performance for view loads.
In short, the total cost of ownership for these databases is quite similar, but only when applied to the appropriate use case. Each solution shines and performs well in the areas it was designed for and quickly runs into performance problems and thus added costs in areas it serves only secondarily.
HBase is primarily recommended as the storage for batch operations while Cassandra is better suited to the view layer of big data. Often applying this architecture principle will provide the best performance but the overall use case and business needs should always be considered to determine which solution is best.
That being said, Cassandra is the fastest (and therefore cheapest) database in regards to writes because of the high level of attention given to how the data is stored when the database has been properly designed. Therefore, Cassandra is the correct choice applications where a high volume of writes is expected. One common use case for Cassandra is with activity or usage logs. Logs have a high volume of writes so having better performance for writes is ideal.
On the other hand, HBase allows the data to be queried by ranges and does not only match column values. If the business case involves querying information based on ranges, then HBase will perform better and be cheaper to run than Cassandra. For example, one business case like that could be finding all items that fall within a particular price range.
Cloud Bigtable can handle either one of these use cases but its both its biggest strength and its biggest drawback is probably being hosted on Google. Bigtable writes every single operation to the persistent log as they come in, not in batch. In other words, it’s synchronous, rather than asynchronous: by the time the server responds to the client, the data was already written to a log (which is durable and replicated), not just to memory. This makes it highly consistent and would make any other database relatively slow and costly.
However, the distributed file system behind Bigtable (formerly Google File System, now Colossus) is much faster than typical file systems, even though it’s distributed and each write is replicated. On benchmarks using YCSB, Google Cloud Bigtable demonstrates single-digit millisecond latency on both reads and writes even at the end, and therefore represents cost-effective performance at low latency.
Conclusively, we find that each database offering has its own unique strengths and weaknesses. HBase is better for range queries while Cassandra is the only one enabling SQL like queries and is the fastest database in terms of write speed. Google’s Cloud BigTable promises to do it all and has a lot of options for integrations with other big data tools, but it lacks consistency guarantees for multi-row updates or cross-table updates. Furthermore, BigTable is only available via Google’s PaaS offering, thereby making it the most expensive option relative to its open source counterparts HBase and Cassandra.
In short, which database is best for you will depend on your use case, so you should always test different technologies side by side and find the one that suits you best before committing to any one technology.
Whilst we are avid technology geeks ourselves and love the nitty-gritty lugs and bolts, kernel profiling and digging through stack traces, we also recognize the need for a higher-level, more digestible approach to understanding the cloud computing landscape. From this origin and perceived need the AVM Consulting Business Blog series has a slightly different tone, aimed at business or management professionals and decision makers. We hope that this series of cloud business blogs will provide valuable information and new insights into the otherwise highly technical and rapidly changing cloud environment. Lastly, it is important to note that the views expressed in these blogs merely represent the opinions, perspectives, and point of view of AVM Consulting, and although some of the findings are based on facts, the meat of the content is purely subjective and open to interpretation. This is what we think, do what you will with this information.