Hash trees (Merkle Trees)
Ever wondered how GitHub easily and accurately determines what changes have been made to each file, or how distributed database systems like Cassandra ensure proper replication of data across replicas?
The short answer to the above question is — they use Merkle trees. The long answer is well — that’s what this article is for!
Not until the last decade, sharing content over web wasn’t as easy as it is now. To put things in perspective, if I had to share my resume publicly, I basically had to spin up a server, get a domain, host my content on there and then share the address with people who I want to share my content with. At present, this task is as easy as making a post on Medium (like this one!).
However, Medium would still need to store my content on its servers and every time someone requests my content, it gets the content from its store and presents it to the requester. Sounds simple enough! But internally, there are challenges.
If the server that my content is on, is located in New York and I am based out of Hyderabad and the requestor is based out of Melbourne, then to make an update to the content, my update request will have to travel half way round the globe to reach the servers in New York, and for a reader making a read request, the content will again need to be sent over to the reader who is thousands of miles away. This also places a lot of responsibility on the servers in New York. In an event, when these servers fail, we don’t have an alternative.
To most of the challenges posed by these conventional highly concentrated server architectures, a reasonable solution is “decentralization” which in the literal sense means delegating your load on to multiple resources. In simple terms, this means that instead of having just one server in New York, Medium could have one server each in Hyderabad, Melbourne and New York and then, when I update my content, it first gets updated in the server in India, which is then internally synchronized across servers in Melbourne and New York. Now if my content is requested from let’s say Sydney which is much closer to Melbourne than the other two server locations, the read request would be served from Melbourne itself.
Another way could be to split up my content into multiple parts and storing those parts in multiple nearby servers. This way, when the content is requested, instead of getting all the content at once from a single server, the content would be extracted in pieces from multiple servers and then stitched together to create the original view of my content. This is specially helpful when the content files are too large. This actually forms the basis of how Hadoop works.
When we split our files into multiple pieces, another problem that inherently comes up is, how do we know if the content that we are reading is not corrupt. To ensure this, we can use hashes for the file content — this would mean that each chunk of the file could be assigned hashes based on the content within that chunk and then while querying we just verify by recalculating the content based hashes for each chunk. These content-based hashes could also be used as content-based addresses, which can then be sent to the client. But we must understand, that even for this verification to happen, we would have to have a “trusted” server which actually stores the hashes of all the individual chunks of data. Any change in the content of any chunk would also alter the hash of that chunk which would then be communicated to the “trusted server”. The trusted server would then communicate this hash to the peer that requests the content. This sounds like a reasonably good solution, until it fails! Think of a scenario when your “trusted” server itself gets compromised or corrupt. In such an instance, not only would we lose the ability to verify any of the chunks, but also we would realize that even though we solved some of the major problems with centralized servers, there’s still some element of local centralized processing that is required and that comes with its own drawbacks.
The solution to the problem that we just saw above could be a chained hashing, which basically means the root hash of the file would actually be a function of content-based hashes of all the individual chunks. Now when a client requests content from the trusted server, the trusted server would return the full tree of hashes (hash tree) with the individual chunk hashes and the root hash. The client can then verify the content in the address from the content based hashes has been tampered with, or not by validating hashes of each of the chunks and also the root hash.
The below diagram illustrates the structure of a hash tree:
As is evident from the diagram above, the data in every parent node is a hash function of data in its child nodes. The data in the leaf nodes are hash values of atomic chunks of data.
This solution forms the basis of how GitHub identifies changes made to files. Let’s imagine, we have 4 files in a directory and 1 of those is edited in a commit. To identify the changes made to the files, GitHub would first check the data in the root node to see if there are any changes at all. If the root hash calculated for our commit is different from what is is for the state before we made that commit, GitHub would know for sure that there has been some change in that commit. It then drills down checking the hash values at each level, finally reaching the atomic data chunks (files in this case, could also be chunks of a file) and then identifies what change has been made to which file.
Hash trees are also used to ensure consistency between replicas of a distributed database system like Apache Cassandra. If a write happens in one replica, Cassandra identifies which chunk got updated using Hash trees and then transmits only that chunk to the other replicas to ensure consistency between them. This makes it much more efficient than what it could be if it were to transmit the entire file every time an update happens.
There’s another very popular use case that we have been hearing a lot about for the last 3–4 years. Heard blockchain? I would leave that for the readers to explore themselves!
Originally published at https://www.linkedin.com.