Apollo Storage 2017.10.30 Monday

In this post, we introduce Apollo Storage which are a part of Orb DLT (Orb Distributed Ledger Technology) and one of the performance improvement for Apollo Storage.

Orb DLT

Orb DLT is a platform software product. It consists of Apollo which executes decentralized transactions, Core which controls behavior of coins via Apollo and Toolbox which is a suite of tools. You can check our post to get more information of Orb DLT. And, Apollo consists of Apollo Storage which provides simple data management, Apollo Transaction which executes decentralized transaction for the specified data and Apollo analytics which executes analytics query distributedly. In this post, we will mainly introduce Apollo Storage for the data management.

Apollo Storage

Apollo Storage is a storage of multi-dimensional map structure. This provides simple CRUD (Create, Read, Update, Delete) interfaces.

Apollo Storage manages records which are specified by a primary key. You can access the value by a row key (partition key) and a column name. If an underneath storage implementation can manage the order of records per partition with another key (clustering key), a primary key is (row key, clustering key, column name) and you can utilize the sorting of the records.
Apollo Storage can manage multiple storage implementations because it gives abstracts of the storage layer by providing simple and flexible interfaces. You can switch your storage implementation without any change of other components. For example, one of the major implementations for Apollo Storage is Cassandra. Other components can access their data with interfaces of Apollo Storage not depending on Cassandra.

Of course, the underneath implementation greatly affects the performance of Apollo Storage. If you use Cassandra as an implementation for Apollo Storage, the performance of Apollo Storage and Apollo Transaction utilizing it are affected by the implementation. At Orb, we try our own performance improvements for Cassandra to speed up Apollo Storage and Apollo Transaction. We will explain our performance issue and one of the improvements to resolve the issue.

Performance improvement of Apollo Storage

Issue

Apollo Transaction writes data via conditional write interface of Apollo Storage for our own distributed transaction protocol. The process of the interface is heavier than other interface in many storage implementations. Especially, that of Cassandra greatly affects the performance of transaction because it is executed by the CAS (compare And Set) function based on Paxos.

For example, a CAS function of Cassandra requests 4 times writes to the storage disk in total. Three are for updating the state of consensus between multiple nodes. One is for the write data. A lot of storage accesses to the dist are needed because a transaction needs multiple CAS executions. When many transactions are requested, these accesses might cause the performance of transaction to decrease. To resolve this issue, our engineering team added Group Commit Log to reduce the number of storage accesses of Cassandra.

Group Commit Log

Group Commit Log is a function to improve the throughput of write operation by reducing the number of the storage accesses of Cassandra. This function groups the small storage accesses and persists multiple Commit Logs to the disk at once. We are proposing this function to Cassandra community.

In write operation of Cassandra, write data is stored to Memtable on the memory at first. Usually, the write data on Memtable is flushed to the disk like HDD or SSD when the size of write data on Memtable becomes large. However, data on the memory will be lost when the power is down unexpectedly or the process is killed. To avoid data lost, Cassandra persists Commit Log which is a log of a write operation to the disk. When a node of a Cassandra cluster restarts from unexpected shutdown, Cassandra restores write data on Memtable from these persisted Commit Logs. You can choose either Periodic or Batch as a policy to determine when Commit Logs are persisted to the disk. Orb DLT chooses Batch to avoid data lost as below.

Periodic policy persists Commit Logs to the disk periodically. This periodic interval to persist Commit Log is configurable and the default is 10 seconds. This policy will return a response of a write operation when storing the write data to Memtable is completed and won’t wait for persisting Commit Log. Data of a completed write operation might be lost when all nodes who have replica of the data are down unexpectedly.

Batch policy returns a response of a write operation after its Commit Log is persisted. All Commit Logs of completed write operations are always persisted. So, data isn’t lost when unexpected shutdown happens.

Including Apollo, any software which causes fatal errors by data lost of Cassandra should choose Batch policy to avoid data lost. But Batch policy needs to persist Commit Logs per write operation and the persisting causes a lot of small storage accesses. As I said above, 4 Commit Logs are persisted by 1 CAS. As a result of our investigation, multiple Commit Logs are persisted at once only when some Commit Logs are enabled to be persisted at persisting another Commit Log. However, any operation requester can’t control the timing for persisting Commit Logs. The performance of transaction is bounded by IOPS performance of the disk or increasing CPU load for persisting process when a lot of CASs are requested and persist many Commit Logs.

We proposed Group Commit Log to group multiple Commit Log and persist them to the disk at once. Group Commit Log persists Commit Logs periodically like Periodic policy and keeps write operations waiting for the completion of persisting their Commit Log. Batch releases a semaphore to wait the persisting thread for Commit Log immediately. We can make Group Commit Log by modifying only 1 line to keeps the semaphore. The periodic interval to persist Commit Log is also configurable and we suppose you set its interval about 10 milliseconds ~ a few milli seconds.
The below diagram shows behaviors of Group and Batch. It focuses on 2 threads of blue and orange lines. The blue is a thread to keep a write operation waiting for persisting its Commit Log. The orange is a thread to persist Commit Logs and dark orange blocks are processes to store Commit Logs to the disk. On the right, a request of write operation release the semaphore immediately and persisting thread starts to persist its Commit Log. The Commit Logs persisted at that time are only requested ones between the previous persisting. On the left, Group doesn’t touch the semaphore and keeps write operations waiting for a signal of completion to persist their Commit Logs. These Commit Logs which are requested by write operations between the interval (group_window) will be grouped to be persisted to the disk at once since the persisting thread persists Commit Logs periodically.

Evaluation

In our evaluation, we use Cassandra 3 nodes cluster. A node is an AWS EC2 m4.large instance and each node has 2 volumes of EBS io1 200 IOPS (as HDD) for Cassandra data and Commit Log. We measured latency and the number of Commit Logs persisted at once when the fixed throughput performed. Compared to Batch, Group could persist up to 1.5 times larger number of Commit Logs at once and the average latency is 47 % less than Batch.

Summary

This post introduced Apollo Storage and Group Commit Log. Apollo Storage is part of Orb DLT and provides the simple and flexible interface to enable to manage multiple storage implementations by the abstraction. Group Commit Log as one of performance improvements at Orb. We are try to improve the performance of Cassandra which is one of storage implementations for Apollo Storage. Group Commit Log can improve the latency of write operation. We are working on the Cassandra community to add Group Commit Log. Many people collaborate and are improving Cassandra with active discussion and development like other OSS community. The community will help us with their great work. And, we will keep contribution to the community to improve Cassandra like proposal of Group Commit Log.
Yuji Itoの最近記事