Orb DLTOrb DLT is a platform software product. It consists of Apollo which executes decentralized transactions, Core which controls behavior of coins via Apollo and Toolbox which is a suite of tools. You can check our post to get more information of Orb DLT. And, Apollo consists of Apollo Storage which provides simple data management, Apollo Transaction which executes decentralized transaction for the specified data and Apollo analytics which executes analytics query distributedly. In this post, we will mainly introduce Apollo Storage for the data management.
Apollo StorageApollo Storage is a storage of multi-dimensional map structure. This provides simple CRUD (Create, Read, Update, Delete) interfaces.
Apollo Storage manages records which are specified by a primary key. You can access the value by a row key (partition key) and a column name. If an underneath storage implementation can manage the order of records per partition with another key (clustering key), a primary key is (row key, clustering key, column name) and you can utilize the sorting of the records.
Of course, the underneath implementation greatly affects the performance of Apollo Storage. If you use Cassandra as an implementation for Apollo Storage, the performance of Apollo Storage and Apollo Transaction utilizing it are affected by the implementation. At Orb, we try our own performance improvements for Cassandra to speed up Apollo Storage and Apollo Transaction. We will explain our performance issue and one of the improvements to resolve the issue.
Performance improvement of Apollo Storage
IssueApollo Transaction writes data via conditional write interface of Apollo Storage for our own distributed transaction protocol. The process of the interface is heavier than other interface in many storage implementations. Especially, that of Cassandra greatly affects the performance of transaction because it is executed by the CAS (compare And Set) function based on Paxos.
For example, a CAS function of Cassandra requests 4 times writes to the storage disk in total. Three are for updating the state of consensus between multiple nodes. One is for the write data. A lot of storage accesses to the dist are needed because a transaction needs multiple CAS executions. When many transactions are requested, these accesses might cause the performance of transaction to decrease. To resolve this issue, our engineering team added Group Commit Log to reduce the number of storage accesses of Cassandra.
Group Commit LogGroup Commit Log is a function to improve the throughput of write operation by reducing the number of the storage accesses of Cassandra. This function groups the small storage accesses and persists multiple Commit Logs to the disk at once. We are proposing this function to Cassandra community.
In write operation of Cassandra, write data is stored to Memtable on the memory at first. Usually, the write data on Memtable is flushed to the disk like HDD or SSD when the size of write data on Memtable becomes large. However, data on the memory will be lost when the power is down unexpectedly or the process is killed. To avoid data lost, Cassandra persists Commit Log which is a log of a write operation to the disk. When a node of a Cassandra cluster restarts from unexpected shutdown, Cassandra restores write data on Memtable from these persisted Commit Logs. You can choose either Periodic or Batch as a policy to determine when Commit Logs are persisted to the disk. Orb DLT chooses Batch to avoid data lost as below.
Periodic policy persists Commit Logs to the disk periodically. This periodic interval to persist Commit Log is configurable and the default is 10 seconds. This policy will return a response of a write operation when storing the write data to Memtable is completed and won’t wait for persisting Commit Log. Data of a completed write operation might be lost when all nodes who have replica of the data are down unexpectedly.
Batch policy returns a response of a write operation after its Commit Log is persisted. All Commit Logs of completed write operations are always persisted. So, data isn’t lost when unexpected shutdown happens.
Including Apollo, any software which causes fatal errors by data lost of Cassandra should choose Batch policy to avoid data lost. But Batch policy needs to persist Commit Logs per write operation and the persisting causes a lot of small storage accesses. As I said above, 4 Commit Logs are persisted by 1 CAS. As a result of our investigation, multiple Commit Logs are persisted at once only when some Commit Logs are enabled to be persisted at persisting another Commit Log. However, any operation requester can’t control the timing for persisting Commit Logs. The performance of transaction is bounded by IOPS performance of the disk or increasing CPU load for persisting process when a lot of CASs are requested and persist many Commit Logs.
We proposed Group Commit Log to group multiple Commit Log and persist them to the disk at once. Group Commit Log persists Commit Logs periodically like Periodic policy and keeps write operations waiting for the completion of persisting their Commit Log. Batch releases a semaphore to wait the persisting thread for Commit Log immediately. We can make Group Commit Log by modifying only 1 line to keeps the semaphore. The periodic interval to persist Commit Log is also configurable and we suppose you set its interval about 10 milliseconds ~ a few milli seconds.