Data Sharding

Table of Contents

Intro
What is Database Sharding?
Pros and Cons
Sharding Strategies
When Should I Shard?

Intro

Any application or website that sees significant growth will eventually need to scale in order to accommodate increases in traffic. Imagine you created a product and its now finally started getting some traction and from your active customers. It has features and generating more data everyday. For data-driven application and websites, its critical that scaling is done the right way that ensures the security and interity of the data. When your database becomes effectively so large reading out of it is taking time, indicating that the database is becoming a bottleneck for your application. And that's where database sharding comes into play.

What is Database Sharding?

Database sharding is the process of making partition process of partitions of data in database or search engine, such that the data is divided into distinct chunks or logical shards or just shards. Sharding is a common scalability strategy used when designing server side system. The server side system architecture uses concepts like sharding to make system more scalable, reliable and performant.

Using this technique we basically split the a large dataset into smaller shards and store these chunks of information in different databases or nodes,reffered to as the physical shards. Each shards has the same database schema as the original database. This means that the shards are autonomous; they don’t share any of the same data or computing resources. In some cases, though, it may make sense to replicate certain tables into each shard to serve as reference tables. For example, let’s say there’s a database for an application that depends on fixed conversion rates for weight measurements. By replicating a table containing the necessary conversion rate data into each shard, it would help to ensure that all of the data required for queries is held in every shard. Exemplifying shared-nothing architecture.

The data is distributed in such a way that each row appears in exactly one shard. By using sharding the database, allows our application to make less queries. For example, When it recieves a request, the application knows where to route the request and thus it has to look through less data , rather going through the whole database. Therefore, making your application performant and lets you rest easier and also not having to worry about the scalabilty.

Lets understand this concept with an analogy,

Here we have partitioned the data according to each different servers, which will thereby be taking the load of the request which are being sent by our customer to our database. So for that we assign server a request-id, so that each server works in its own proximity of ids assigned to them, beforehand. Here we've partioned the data and indexed along a keyID. This kind of the partioning which uses index or a key to break the data into pieces and allocate them towards different servers is known as Horizontal Partioning.

Horizontal sharding- When each new table has the same schema but unique rows, it is known as horizontal sharding. In this type of sharding, more machines are added to an existing stack to spread out the load, increase processing speed and support more traffic. This method is most effective when queries return a subset of rows that are often grouped together. Horizontal Partioning depends on one key, which is the attribute of the data that you're storing to partion the data. In contrast, there is vertical partioning as well which uses columns of the data to partion the data.

Vertical sharding- When each new table has a schema that is a faithful subset of the original table's schema, it is known as vertical sharding. It is effective when queries usually return only a subset of columns of the data.

Warning: After a database is sharded, the data in the new tables is spread across multiple systems, but with partitioning, that is not the case. Partitioning groups data subsets within a single database instance.

It can be helpful to think of horizontal partitioning in terms of how it relates to vertical partitioning. In a vertically-partitioned table, entire columns are separated out and put into new, distinct tables. The data held within one vertical partition is independent from the data in all the others, and each holds both distinct rows and columns. The following diagram illustrates how a table could be partitioned both horizontally and vertically:

Note:Please keep in mind that the servers, we're talking about here are database servers.

You can contrast it with application servers, platform servers which do deal with data but they try to be stateless as possible. Here consistency is really important, it is the key attributes for any database data you persit in it, is what you can read out if it later on. There should be some sort of synchronization that if a user makes an update the request is going to read that update.

We should also look into the availablity of the database, meaning that the database should not crash and stay down. As enterprise, we want our database should be up running. Consistency trumps availabilty, in most cases.

There are few things to think about, like what should you shard your data on? Suppose we shard on userID or in the case of any application that are working with geolocation databases like Tinder, Uber. Any company that is trying to solve travelling salesman problem like the grocery or food delivery business like Zepto. We can shard data on 'location', then if a query comes thats asks for all the users or warehouses nearby in city_X. The city_X will fall in specific shard and the algorithm will read route itself. The shard is going to be smaller in size, also its going to be easier to maintain giving us an optimal solution to the problem.

Pros and Cons

Sharding is a horizontal scaling strategy. This approach is more robust and effective than vertical scaling. Vertical scaling is much more easier to execute, since it consist of only hardware upgrades. It might be a correct approach for a mid-sized company database that is slowly reaching its limit. However, it is impossible to scale a system that has an so it has both upsides and also the downside to it:

It improves the performance and increases the information retrival. Based on the sharding key, the database system immmediately knows which shard contains the data. It quickly route itself to the right server.
It can be cost effective to run multiple-server rather than one big monolith.
Sharding can simplify upgrades, allowing one server to be upgraded at a time.
A sharded approach is more resilient. If one of the servers is offline, the remaining shards are still accessible. Sharding can be combined with high availability techniques for even higher reliability.

Unfortunately, sharding also has drawbacks. Some of the downsides include:

Sharding greatly increases the complexity of a software development project. Additional logic is required to shard the database and properly direct queries to the correct shard. This increases development time and cost. A more elaborate network mesh is often necessary, which leads to an increase in lab and infrastructure costs.
Sharding requires a lot of tuning and tweaking as the database grows. This sometimes requires a reconsideration of the entire sharding strategy and database design. Uneven shard distribution can happen even with proper planning, causing the distribution to unexpectedly become lopsided.
It is not always obvious how many shards and servers to use, or how to choose the sharding key. Poor sharding keys can adversely affect performance or data distribution. This causes some shards to be overloaded while others are almost empty, leading to hotspots and inefficiencies
It is more challenging to change the database schema after sharding is implemented. It is also difficult to convert the database back to its pre-sharded state.
Sharding increases the complexity of a database and increases the difficulty of joins and schema changes.
Backups are more difficult with a sharded database.

Sharding Strategies

While creating Database on should keep in mind that how many shards to use and how to distribute the data to the various server. The designer decides what queries to optimize and how to handle joins and the bulk data reterival. A system in which the data frequently changes requires a different architecture than one that mainly handles the read request. Then Reliablity, replication and maintenace strategy becomes important. The choice of sharding architecture becomes a critical decision, since it can affect many other consideration. Some common architectures are:

Range Sharding.
Hashed Sharding.
Directory-Based Sharding.
Geographic-Based Sharding.

Range Sharding

Range sharding work on the value of the key and try to find the range it falls into. Each range directly maps to a different. The sharding range should ideally be immutable. If the key changes, the shard is recalculated and the data is copied to a new shard. If it changes then, mapping is destroyed and ther location can be lost. Range Shard is also known as dynamic sharding.

As an example, if the userID field is the sharding key, then records having IDs between 1 to 10000 could be stored in one shard. IDs between 10001 and 20000 map to a second shard, and those between 20001 and 30000 to a third.

This approach is easy to implement and less programming time is required. The database application only has to compare the value of the sharding key to the predefined ranges using a lookup table. Range sharding is a good choice if records with similar keys are frequently viewed together.

Hash Sharding

This type of sharding is also known as algorithmic sharding. It maps the shard key directly to a shard by applying a hash function to the shard key. The hash function transforms one or more data points to a unique value that lies in a fixed range. Here, the size of range and number of shards are equal. The database takes input from the hash function to allocate the record to shard. In in return, results in a more evenly distributed records to different shards.

More complex hashing algorithms apply mathematically advanced equations to multiple inputs. However, it is important to use the same hash function on the same keys for each hashing operation. As with range sharding, the key value should be immutable. If it changes, the hash value must be recalculated and the database entry remapped.

Key-Based Sharding is more efficient than range sharding because the lookup table isn't required. The hash is computed in real time for each query. Hash sharding is more advantageous for applications that read or write one record at a time. But the downside to Hash sharding is that, it complicates the tasks of rebalancing and rebuilding the shards and to if you want to add more shards, it becomes necessary to re-merge all the data, recalculate the hashes, and reassign all the records.

Directory-Based Sharding

Directory-based sharding groups related items together on the same shard. This is known as entity-based sharding. IIn this method, we create and maintain a lookup service or lookup table for the original database. Basically we use a shard key for lookup table and we do mapping for each entity that exists in the database. This way we keep track of which database shards hold which data.

The lookup table holds a static set of information about where specific data can be found. In the above image, you can see that we have used the delivery zone as a shard key. Firstly the client application queries the lookup service to find out the shard (database partition) on which the data is placed. When the lookup service returns the shard it queries/updates that shard.

Directory-based sharding is much more flexible than range based and key-based sharding. In range-based sharding, you’re bound to specify the ranges of values. In key-based, you are bound to use a fixed hash function which is difficult to change later. In this approach, you’re free to use any algorithm you want to assign to data entries to shards. Also, it’s easy to add shards dynamically in this approach.

The major drawback of this approach is the single point of failure of the lookup table. If it will be corrupted or failed then it will impact writing new data or accessing existing data from the table.

This architecture is very helpful if the shard key can only be assigned a small number of possible values. Unfortunately, it is highly prone to clustering and imbalanced tables, and the overhead of accessing the lookup table degrades performance.

Geographic-Based Sharding

Geographic-based sharding, or Geo-sharding, is a specific type of directory-based sharding. Data is divided amongst the shards based on the location of the entry, which relates to the location of the server hosting the shard. The sharding key is typically a city, state, region, country, or continent. This groups geographically similar data on the same shard. It works the same way directory-based sharding does.

A good example of geo-sharding relates to geographically dispersed customer data. The customer’s home state is used as a sharding key. The lookup table maps customers living in states in the same sales region to the same shard. Each shard is located on a server located in the same region as the customer data it contains. This makes it very quick and efficient for a regional sales team to access customer data.

When Should I Shard?

Whether or not one should implement a sharded database is always a matter of debate. Sharding is only performed when dealing with very large amount of data. It is beneficial to shard a database when:-

The amount of data exceed to the storage capacity of a single database node.
The net volume of writes and reads to a database surpasses what a single node or its read replicas can handle, resulting in slow response time. -The network bandwidth required by the application outspaces the bandwidth available to single database node, resulting in slow response time.

Before sharding, one should exhaust all other options before optimising the database. Here are some tips:

If you are working with a monolith in which all of it's components reside over the same server, then what you can do is by moving data over to its own machine. This will allow you to scale your database vertically from the rest of your infrastructure .
Implement Caching. If your application read performance is causing the trouble, then caching can help you improve and remove this roadblock.
Creating one or more read replicas. Another strategy that can help to improve read performance, this involves copying the data from one database server (the primary server) over to one or more secondary servers. Following this, every new write goes to the primary before being copied over to the secondaries, while reads are made exclusively to the secondary servers. Distributing reads and writes like this keeps any one machine from taking on too much of the load, helping to prevent slowdowns and crashes. By creating read replicas involves more computing resources and thus costs more money, which could be a significant constraint for some.

Please keep in mind that if your application or website grows past a certain point, none of these strategies will be enough to improve performance on their own. In such cases, sharding may indeed be the best option for you.