Shard (database architecture)

A database shard is a horizontal partition of data in a database or search engine. Each individual partition is referred to as a shard or database shard. Each shard is held on a separate database server instance, to spread load.
Some data within a database remains present in all shards, but some appears only in a single shard. Each shard acts as the single source for this subset of data.

Database architecture

Horizontal partitioning is a database design principle whereby rows of a database table are held separately, rather than being split into columns. Each partition forms part of a shard, which may in turn be located on a separate database server or physical location.
There are numerous advantages to the horizontal partitioning approach. Since the tables are divided and distributed into multiple servers, the total number of rows in each table in each database is reduced. This reduces index size, which generally improves search performance. A database shard can be placed on separate hardware, and multiple shards can be placed on multiple machines. This enables a distribution of the database over a large number of machines, greatly improving performance. In addition, if the database shard is based on some real-world segmentation of the data then it may be possible to infer the appropriate shard membership easily and automatically, and query only the relevant shard.
Disadvantages include:
Main section: Disadvantages

A heavier reliance on the interconnection between servers
Increased latency when querying, especially where more than one shard must be searched.
Data or indexes are often only sharded one way, so that some searches are optimal, and others are slow or impossible.
Issues of consistency and durability due to the more complex failure modes of a set of servers, which often result in systems making no guarantees about cross-shard consistency or durability.

In practice, sharding is complex. Although it has been done for a long time by hand-coding, this is often inflexible. There is a desire to support sharding automatically, both in terms of adding code support for it, and for identifying candidates to be sharded separately. Consistent hashing is a technique used in sharding to spread large loads across multiple smaller services and servers.
Where distributed computing is used to separate load between multiple servers, a shard approach may also be useful.

Shards compared to horizontal partitioning

splits one or more tables by row, usually within a single instance of a schema and a database server. It may offer an advantage by reducing index size provided that there is some obvious, robust, implicit way to identify in which partition a particular row will be found, without first needing to search the index, e.g., the classic example of the 'CustomersEast' and 'CustomersWest' tables, where their zip code already indicates where they will be found.
Sharding goes beyond this: it partitions the problematic table in the same way, but it does this across potentially multiple instances of the schema. The obvious advantage would be that search load for the large partitioned table can now be split across multiple servers, not just multiple indexes on the same logical server.
Splitting shards across multiple isolated instances requires more than simple horizontal partitioning. The hoped-for gains in efficiency would be lost, if querying the database required multiple instances to be queried, just to retrieve a simple dimension table. Beyond partitioning, sharding thus splits large partitionable tables across the servers, while smaller tables are replicated as complete units.
This is also why sharding is related to a shared nothing architecture—once sharded, each shard can live in a totally separate logical schema instance / physical database server / data center / continent. There is no ongoing need to retain shared access to the other unpartitioned tables in other shards.
This makes replication across multiple servers easy. It is also useful for worldwide distribution of applications, where communications links between data centers would otherwise be a bottleneck.
There is also a requirement for some notification and replication mechanism between schema instances, so that the unpartitioned tables remain as closely synchronized as the application demands. This is a complex choice in the architecture of sharded systems: approaches range from making these effectively read-only, to dynamically replicated tables and many options in between.

Notable implementations

Altibase: provides combined sharding architecture transparent to client applications.
Apache HBase: HBase provides automatic sharding.
Azure SQL Database Elastic Database tools: enables the data-tier of an application to scale out and in via industry-standard sharding practices.
ClickHouse: ClickHouse is a fast open-source OLAP database management system and includes sharding.
Couchbase: provides automatic transparent sharding as well as extreme performance.
CUBRID: allows sharding from version 9.0
DRDS of Alibaba Cloud: enables database/table sharding, and supports Singles' Day.
Elasticsearch: enterprise search server provides sharding capabilities.
eXtreme Scale: a cross-process in-memory key/value datastore. It uses sharding to achieve scalability across processes for both data and MapReduce-style parallel processing.
Hibernate Shards: provides for shards, although there has been little activity since 2007.
IBM Informix: IBM has allowed sharding in Informix since version 12.1 xC1 as part of the MACH11 technology. Informix 12.10 xC2 added full compatibility with MongoDB drivers, allowing the mix of regular relational tables with NoSQL collections, while still allowing sharding, failover and ACID properties.
Kdb+: allows sharding from version 2.0.
MonetDB: the open-source column-store MonetDB allows read-only sharding as its July 2015 release.
MongoDB: allows sharding from version 1.6.
MySQL Cluster: Auto-Sharding: Database is automatically and transparently partitioned across low-cost commodity nodes, allowing scale-out of read and write queries, without requiring changes to the application.
MySQL Fabric includes sharding capability.
Oracle Sharding: introduced as new feature in Oracle Database 12c Release 2 and in one liner: Combination of sharding advantages with well-known capabilities of enterprise ready multi-model Oracle Database.
Oracle NoSQL Database: has automatic sharding and elastic, online expansion of the cluster.
OrientDB: allows sharding from version 1.7
Solr enterprise search server: provides sharding capabilities.
Spanner: Google's global-scale distributed database, shards data across multiple Paxos state machines to scale to "millions of machines across hundreds of datacenters and trillions of database rows".
SQLAlchemy ORM: a data-mapper for the Python programming language that provides sharding capabilities.
The DWH of Teradata: a massive parallel database.
Vault: A cryptocurrency that drastically reduces the data that users need to join the network and verify transactions. This translates to a much more scalable network. Designed by MIT researchers. Sharding is central to its functioning.
Vitess: An open-source database clustering system that can horizontally scale MySQL. Vitess is a Cloud Native Computing Foundation project.
Disadvantages

Sharding a database table before it has been optimized locally causes premature complexity. Sharding should be used only when all other options for optimization are inadequate. The introduced complexity of database sharding causes the following potential problems:

Increased complexity of SQL - Increased bugs because the developers have to write more complicated SQL to handle sharding logic.
Sharding introduces complexity - The sharding software that partitions, balances, coordinates, and ensures integrity can fail.
Single point of failure - Corruption of one shard due to network/hardware/systems problems causes failure of the entire table.
Failover servers more complex - Failover servers must themselves have copies of the fleets of database shards.
Backups more complex - Database backups of the individual shards must be coordinated with the backups of the other shards.
Operational complexity added - Adding/removing indexes, adding/deleting columns, modifying the schema becomes much more difficult.

These historical complications of do-it-yourself sharding were addressed by independent software vendors who provided automatic sharding.

Etymology

In a database context, most recognize the term "shard" is most likely derived from either one of two sources: Computer Corporation of America's "A System for Highly Available Replicated Data", which utilized redundant hardware to facilitate data replication ; or the critically acclaimed 1997 MMORPG video game Ultima Online which set 8 Guinness World Records and was designated by Time as one of the 100 greatest video games produced of all time.
Richard Garriott, creator of Ultima Online, recollects the term being coined during production phase when they attempted to create a self-regulating virtual ecology system, whereby players may leverage new internet access to interact and harvest in-game resources. Although the virtual ecology functioned as intended during in-house testing, its natural balance failed "almost instantaneously" due to players killing off every living wildlife across the playable area faster than the spawning system could operate. Garriott's production team attempted to mitigate this issue by separating the global player base into separate sessions, and rewriting part of Ultima Online fictional connection to the end of , where the defeat of its antagonist Mondain also led to the creation of multiverse "shards". This modification provided Garriott's team with the fictional basis needed to justify creating copies of the virtual environment. However, the game's sharp rise to critical acclaim also meant that the new multiverse virtual ecology system was quickly overwhelmed as well. After several months of testing, Garriott's team decided to abandon the feature altogether, and stripped the game of its functionality.
Today, the term "shard" refers to the deployment and use of redundant hardware across database systems.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...