Entity–attribute–value model
Entity–attribute–value model is a data model to encode, in a space-efficient manner, entities where the number of attributes that can be used to describe them is potentially vast, but the number that will actually apply to a given entity is relatively modest. Such entities correspond to the mathematical notion of a sparse matrix.
EAV is also known as object–attribute–value model, vertical database model, and open schema.
Data structure
This data representation is analogous to space-efficient methods of storing a sparse matrix, where only non-empty values are stored. In an EAV data model, each attribute-value pair is a fact describing an entity, and a row in an EAV table stores a single fact. EAV tables are often described as "long and skinny": "long" refers to the number of rows, "skinny" to the few columns.Data is recorded as three columns:
- The entity: the item being described.
- The attribute or parameter: typically implemented as a foreign key into a table of attribute definitions. The attribute definitions table might contain the following columns: an attribute ID, attribute name, description, data type, and columns assisting input validation, e.g., maximum string length and regular expression, set of permissible values, etc.
- The value of the attribute.
Example
The following shows a snapshot of an EAV table for clinical findings from a visit to a doctor for a fever on the morning of 1/5/98. The entries shown within angle brackets are references to entries in other tables, shown here as text rather than as encoded foreign key values for ease of understanding. In this example, the values are all literal values, but they could also be pre-defined value lists. The latter are particularly useful when the possible values are known to be limited.
- The entity. For clinical findings, the entity is the patient event: a foreign key into a table that contains at a minimum a patient ID and one or more time-stamps that record when the event being described happened.
- The attribute or parameter: a foreign key into a table of attribute definitions. At the very least, the attribute definitions table would contain the following columns: an attribute ID, attribute name, description, data type, units of measurement, and columns assisting input validation, e.g., maximum string length and regular expression, maximum and minimum permissible values, set of permissible values, etc.
- The value of the attribute. This would depend on the data type, and we discuss how values are stored shortly.
The EAV data described above is comparable to the contents of a supermarket sales receipt. The receipt lists only details of the items actually purchased, instead of listing every product in the shop that the customer might have purchased but didn't. Like the clinical findings for a given patient, the sales receipt is sparse.
...
- The "entity" is the sale/transaction id — a foreign key into a sales transactions table. This is used to tag each line item internally, though on the receipt the information about the Sale appears at the top and at the bottom.
- The "attribute" is a foreign key into a products table, from where one looks up description, unit price, discounts and promotions, etc.
- The "values" are the quantity purchased and total line item price.
- A row-modeled table is homogeneous in the facts that it describes: a Line Items table describes only products sold. By contrast, an EAV table contains almost any type of fact.
- The data type of the value column/s in a row-modeled table is pre-determined by the nature of the facts it records. By contrast, in an EAV table, the conceptual data type of a value in a particular row depends on the attribute in that row. It follows that in production systems, allowing direct data entry into an EAV table would be a recipe for disaster, because the database engine itself would not be able to perform robust input validation. We shall see later how it is possible to build generic frameworks that perform most of the tasks of input validation, without endless coding on an attribute-by-attribute basis.
The circumstances where you would need to go beyond standard row-modeling to EAV are listed below:
- The data type of individual attributes varies.
- The categories of data are numerous, growing or fluctuating, but the number of instances within each category is very small. Here, with conventional modeling, the database’s entity–relationship diagram might have hundreds of tables: the tables that contain thousands/ millions of rows/instances are emphasized visually to the same extent as those with very few rows. The latter are candidates for conversion to an EAV representation.
Certain classes have some attributes that are non-sparse, while other attributes are highly variable and sparse. The latter are suitable for EAV modeling. For example, descriptions of products made by a conglomerate corporation depend on the product category, e.g., the attributes necessary to describe a brand of light bulb are quite different from those required to describe a medical imaging device, but both have common attributes such as packaging unit and per-item cost.
Description of concepts
The entity
In clinical data, the entity is typically a clinical event, as described above. In more general-purpose settings, the entity is a foreign key into an "objects" table that records common information about every "object" in the database – at the minimum, a preferred name and brief description, as well as the category/class of entity to which it belongs. Every record in this table is assigned a machine-generated object ID.The "objects table" approach was pioneered by Tom Slezak and colleagues at Lawrence Livermore Laboratories for the Chromosome 19 database, and is now standard in most large bioinformatics databases. The use of an objects table does not mandate the concurrent use of an EAV design: conventional tables can be used to store the category-specific details of each object.
The major benefit to a central objects table is that, by having a supporting table of object synonyms and keywords, one can provide a standard Google-like search mechanism across the entire system where the user can find information about any object of interest without having to first specify the category that it belongs to. (This is important in bioscience systems where a keyword like "acetylcholine" could refer either to the molecule itself, which is a neurotransmitter, or the biological receptor to which it binds.
The attribute
In the EAV table itself, this is just an attribute ID, a foreign key into an Attribute Definitions table, as stated above. However, there are usually multiple metadata tables that contain attribute-related information, and these are discussed shortly.The value
Coercing all values into strings, as in the EAV data example above, results in a simple, but non-scalable, structure: constant data type inter-conversions are required if one wants to do anything with the values, and an index on the value column of an EAV table is essentially useless. Also, it is not convenient to store large binary data, such as images, in Base64 encoded form in the same table as small integers or strings. Therefore, larger systems use separate EAV tables for each data type, with the metadata for a given attribute identifying the EAV table in which its data will be stored. This approach is actually quite efficient because the modest amount of attribute metadata for a given class or form that a user chooses to work with can be cached readily in memory. However, it requires moving of data from one table to another if an attribute’s data type is changed.History
EAV, as a general-purpose means of knowledge representation, originated with the concept of "association lists". Commonly used today, these were first introduced in the language LISP.Attribute-value pairs are widely used for diverse applications, such as configuration files. An example of non-database use of EAV is in UIMA, a standard now managed by the Apache Foundation and employed in areas such as natural language processing. Software that analyzes text typically marks up a segment: the example provided in the UIMA tutorial is a program that performs named-entity recognition on a document, annotating the text segment "President Bush" with the annotation-attribute-value triple . Such annotations may be stored in a database table.
While EAV does not have a direct connection to AV-pairs, Stead and Hammond appear to be the first to have conceived of their use for persistent storage of arbitrarily complex data.
The first medical record systems to employ EAV were the Regenstrief electronic medical record, William Stead and Ed Hammond's TMR system and the HELP Clinical Data Repository created by Homer Warner's group at LDS Hospital, Salt Lake City, Utah. All these systems, developed in the 1970s, were released before commercial systems based on E.F. Codd's relational database model were available, though HELP was much later ported to a relational architecture and commercialized by the 3M corporation.
A group at the Columbia-Presbyterian Medical Center were the first to use a relational database engine as the foundation of an EAV system.
The open-source clinical study data management system of Nadkarni et al. was the first to use multiple EAV tables, one for each DBMS data type.
The EAV/CR framework, designed primarily by Luis Marenco and Prakash Nadkarni, overlaid the principles of object orientation onto EAV; it built on Tom Slezak's object table approach. , a publicly accessible neuroscience database, is built with the EAV/CR framework. Additionally, there are numerous commercial applications that use aspects of EAV internally including Oracle Designer, Kalido, and Lazysoft Sentences.
An EAV system that does not sit on top of a tabular structure but instead directly on a B Tree is InfinityDB, which eliminates the need for one table per value data type.
Use in databases
The term "EAV database" refers to a database design where a significant proportion of the data is modeled as EAV. However, even in a database described as "EAV-based", some tables in the system are traditional relational tables.As noted above, EAV modeling makes sense for categories of data, such as clinical findings, where attributes are numerous and sparse. Where these conditions do not hold, standard relational modeling is preferable; using EAV does not mean abandoning common sense or principles of good relational design. In clinical record systems, the subschemas dealing with patient demographics and billing are typically modeled conventionally. medical system, known as the Veterans Health Administration
As discussed shortly, an EAV database is essentially unmaintainable without numerous supporting tables that contain supporting metadata. The metadata tables, which typically outnumber the EAV tables by a factor of at least three or more, are typically standard relational tables. An example of a metadata table is the Attribute Definitions table mentioned above.
EAV/CR: representing substructure with classes and relationships
In a simple EAV design, the values of an attribute are simple or primitive data types as far as the database engine is concerned. However, in EAV systems used for representation of highly diverse data, it is possible that a given object may have substructure: that is, some of its attributes may represent other kinds of objects, which in turn may have substructure, to an arbitrary level of complexity. A car, for example, has an engine, a transmission, etc., and the engine has components such as cylinders.To represent substructure, one incorporates a special EAV table where the value column contains references to other entities in the system. To get all the information on a given object requires a recursive traversal of the metadata, followed by a recursive traversal of the data that stops when every attribute retrieved is simple. Recursive traversal is necessary whether details of an individual class are represented in conventional or EAV form; such traversal is performed in standard object–relational systems, for example. In practice, the number of levels of recursion tends to be relatively modest for most classes, so the performance penalties due to recursion are modest, especially with indexing of object IDs.
EAV/CR refers to a framework that supports complex substructure. Its name is somewhat of a misnomer: while it was an outshoot of work on EAV systems, in practice, many or even most of the classes in such a system may be represented in standard relational form, based on whether the attributes are sparse or dense. EAV/CR is really characterized by its very detailed metadata, which is rich enough to support the automatic generation of browsing interfaces to individual classes without having to write class-by-class user-interface code. The basis of such browser interfaces is that it is possible to generate a batch of dynamic SQL queries that is independent of the class of the object, by first consulting its metadata and using metadata information to generate a sequence of queries against the data tables, and some of these queries may be arbitrarily recursive. This approach works well for object-at-a-time queries, as in Web-based browsing interfaces where clicking on the name of an object brings up all details of the object in a separate page: the metadata associated with that object's class also facilitates presentation of the object's details, because it includes captions of individual attributes, the order in which they are to be presented as well as how they are to be grouped.
One approach to EAV/CR is to allow columns to hold JSON structures, which thus provide the needed class structure. For example, PostgreSQL, as of version 9.4, offers JSON binary column support, allowing JSON attributes to be queried, indexed and joined.
Metadata
In the words of Prof. Dr. Daniel Masys, the challenges of working with EAV stem from the fact that in an EAV database, the "physical schema" is radically different from the "logical schema" – the way users, and many software applications such as statistics packages, regard it, i.e., as conventional rows and columns for individual classes.Metadata helps perform the sleight of hand that lets users interact with the system in terms of the logical schema rather than the physical: the software continually consults the metadata for various operations such as data presentation, interactive validation, bulk data extraction and ad hoc query. The metadata can actually be used to customize the behavior of the system.
EAV systems trade off simplicity in the physical and logical structure of the data for complexity in their metadata, which, among other things, plays the role that database constraints and referential integrity do in standard database designs. Such a tradeoff is generally worthwhile, because in the typical mixed schema of production systems, the data in conventional relational tables can also benefit from functionality such as automatic interface generation. The structure of the metadata is complex enough that it comprises its own subschema within the database: various foreign keys in the data tables refer to tables within this subschema. This subschema is standard-relational, with features such as constraints and referential integrity being used to the hilt.
The correctness of the metadata contents, in terms of the intended system behavior, is critical and the task of ensuring correctness means that, when creating an EAV system, considerable design efforts must go into building user interfaces for metadata editing that can be used by people on the team who know the problem domain but are not necessarily programmers.
Where an EAV system is implemented through RDF, the RDF Schema language may conveniently be used to express such metadata. This Schema information may then be used by the EAV database engine to dynamically re-organize its internal table structure for best efficiency.
Some final caveats regarding metadata:
- Because the business logic is in the metadata rather than explicit in the database schema, it is less apparent to one who is unfamiliar with the system. Metadata-browsing and metadata-reporting tools are therefore important in ensuring the maintainability of an EAV system. In the common scenario where metadata is implemented as a relational sub-schema, these tools are nothing more than applications built using off-the-shelf reporting or querying tools that operate on the metadata tables.
- It is easy for an insufficiently knowledgeable user to corrupt metadata. Therefore, access to metadata must be restricted, and an audit trail of accesses and changes put into place to deal with situations where multiple individuals have metadata access. Using an RDBMS for metadata will simplify the process of maintaining consistency during metadata creation and editing, by leveraging RDBMS features such as support for transactions. Also, if the metadata is part of the same database as the data itself, this ensures that it will be backed up at least as frequently as the data itself, so that it can be recovered to a point in time.
- The quality of the annotation and documentation within the metadata must be much higher, in order to facilitate understanding by various members of the development team. Ensuring metadata quality takes very high priority in the long-term management and maintenance of any design that uses an EAV component. Poorly-documented or out-of-date metadata can compromise the system's long-term viability.
Information captured in metadata
Attribute metadata
- Validation metadata include data type, range of permissible values or membership in a set of values, regular expression match, default value, and whether the value is permitted to be null. In EAV systems representing classes with substructure, the validation metadata will also record what class, if any, a given attribute belongs to.
- Presentation metadata: how the attribute is to be displayed to the user. When a compound object is composed of multiple attributes, as in the EAV/CR design, there is additional metadata on the order in which the attributes should be presented, and how these attributes should optionally be grouped.
- For attributes which happen to be laboratory parameters, ranges of normal values, which may vary by age, sex, physiological state and assay method, are recorded.
- Grouping metadata: Attributes are typically presented as part of a higher-order group, e.g., a specialty-specific form. Grouping metadata includes information such as the order in which attributes are presented. Certain presentation metadata, such as fonts/colors and the number of attributes displayed per row, apply to the group as a whole.
Advanced validation metadata
- Dependency metadata: in many user interfaces, entry of specific values into certain fields/attributes is required to either disable/hide certain other fields or enable/show other fields. To effect this in a generic framework involves storing of dependencies between the controlling attributes and the controlled attributes.
- Computations and complex validation: As in a spreadsheet, the value of certain attributes can be computed, and displayed, based on values entered into fields that are presented earlier in sequence.. Similarly, there may be "constraints" that must be true for the data to be valid: for example, in a differential white cell count, the sum of the counts of the individual white cell types must always equal 100, because the individual counts represent percentages. Computed formulas and complex validation are generally effected by storing expressions in the metadata that are macro-substituted with the values that the user enters and can be evaluated. In Web browsers, both JavaScript and VBScript have an Eval function that can be leveraged for this purpose.
Usage scenarios
EAV modeling, under the alternative terms "generic data modeling" or "open schema", has long been a standard tool for advanced data modelers. Like any advanced technique, it can be double-edged, and should be used judiciously.Also, the employment of EAV does not preclude the employment of traditional relational database modeling approaches within the same database schema. In EMRs that rely on an RDBMS, such as Cerner, which use an EAV approach for their clinical-data subschema, the vast majority of tables in the schema are in fact traditionally modeled, with attributes represented as individual columns rather than as rows.
The modeling of the metadata subschema of an EAV system, in fact, is a very good fit for traditional modeling, because of the inter-relationships between the various components of the metadata. In the TrialDB system, for example, the number of metadata tables in the schema outnumber the data tables by about ten to one. Because the correctness and consistency of metadata is critical to the correct operation of an EAV system, the system designer wants to take full advantages of all of the features that RDBMSs provide, such as referential integrity and programmable constraints, rather than having to reinvent the RDBMS-engine wheel. Consequently, the numerous metadata tables that support EAV designs are typically in third-normal relational form.
Commercial electronic health record Systems use row-modeling for classes of data such as diagnoses, surgical procedures performed on and laboratory test results, which are segregated into separate tables. In each table, the "entity" is a composite of the patient ID and the date/time the diagnosis was made ; the attribute is a foreign key into a specially designated lookup table that contains a controlled vocabulary - e.g., ICD-10 for diagnoses, Current Procedural Terminology for surgical procedures, with a set of value attributes. As stated earlier, this is not a full-fledged EAV approach because the domain of attributes for a given table is restricted, just as the domain of product IDs in a supermarket's Sales table would be restricted to the domain of Products in a Products table.
However, to capture data on parameters that are not always defined in standard vocabularies, EHRs also provide a "pure" EAV mechanism, where specially designated power-users can define new attributes, their data type, maximum and minimal permissible values, and then allow others to capture data based on these attributes. In the Epic EHR, this mechanism is termed "Flowsheets", and is commonly used to capture inpatient nursing observation data.
Modeling sparse attributes
The typical case for using the EAV model is for highly sparse, heterogeneous attributes, such as clinical parameters in the electronic medical record, as stated above. Even here, however, it is accurate to state that the EAV modeling principle is applied to a sub-schema of the database rather than for all of its contents.Consequently, the arguments about EAV vs. "relational" design reflect incomplete understanding of the problem: An EAV design should be employed only for that sub-schema of a database where sparse attributes need to be modeled: even here, they need to be supported by third normal form metadata tables. There are relatively few database-design problems where sparse attributes are encountered: this is why the circumstances where EAV design is applicable are relatively rare. Even where they are encountered, a set of EAV tables is not the only way to address sparse data: an XML-based solution is applicable when the maximum number of attributes per entity is relatively modest, and the total volume of sparse data is also similarly modest. An example of this situation is the problems of capturing variable attributes for different product types.
Sparse attributes may also occur in E-commerce situations where an organization is purchasing or selling a vast and highly diverse set of commodities, with the details of individual categories of commodities being highly variable. The Magento E-commerce software employs an EAV approach to address this issue.
Modeling numerous classes with very few instances per class: highly dynamic schemas
Another application of EAV is in modeling classes and attributes that, while not sparse, are dynamic, but where the number of data rows per class will be relatively modest – a couple of hundred rows at most, but typically a few dozen – and the system developer is also required to provide a Web-based end-user interface within a very short turnaround time. "Dynamic" means that new classes and attributes need to be continually defined and altered to represent an evolving data model. This scenario can occur in rapidly evolving scientific fields as well as in ontology development, especially during the prototyping and iterative refinement phases.While creation of new tables and columns to represent a new category of data is not especially labor-intensive, the programming of Web-based interfaces that support browsing or basic editing with type- and range-based validation is. In such a case, a more maintainable long-term solution is to create a framework where the class and attribute definitions are stored in metadata, and the software generates a basic user interface from this metadata dynamically.
The EAV/CR framework, mentioned earlier, was created to address this very situation. Note that an EAV data model is not essential here, but the system designer may consider it an acceptable alternative to creating, say, sixty or more tables containing a total of not more than two thousand rows. Here, because the number of rows per class is so few, efficiency considerations are less important; with the standard indexing by class ID/attribute ID, DBMS optimizers can easily cache the data for a small class in memory when running a query involving that class or attribute.
In the dynamic-attribute scenario, it is worth noting that Resource Description Framework is being employed as the underpinning of Semantic-Web-related ontology work. RDF, intended to be a general method of representing information, is a form of EAV: an RDF triple comprises an object, a property, and a value.
At the end of Jon Bentley's book "Writing Efficient Programs", the author warns that making code more efficient generally also makes it harder to understand and maintain, and so one does not rush in and tweak code unless one has first determined that there is a performance problem, and measures such as code profiling have pinpointed the exact location of the bottleneck. Once you have done so, you modify only the specific code that needs to run faster. Similar considerations apply to EAV modeling: you apply it only to the sub-system where traditional relational modeling is known a priori to be unwieldy, or is discovered, during system evolution, to pose significant maintenance challenges. Database Guru Tom Kyte, for example, correctly points out drawbacks of employing EAV in traditional business scenarios, and makes the point that mere "flexibility" is not a sufficient criterion for employing EAV.
Working with EAV data
The Achilles heel of EAV is the difficulty of working with large volumes of EAV data. It is often necessary to transiently or permanently inter-convert between columnar and row-or EAV-modeled representations of the same data; this can be both error-prone if done manually as well as CPU-intensive. Generic frameworks that utilize attribute and attribute-grouping metadata address the former but not the latter limitation; their use is more or less mandated in the case of mixed schemas that contain a mixture of conventional-relational and EAV data, where the error quotient can be very significant.The conversion operation is called pivoting. Pivoting is not required only for EAV data but also for any form or row-modeled data. Many database engines have proprietary SQL extensions to facilitate pivoting, and packages such as Microsoft Excel also support it. The circumstances where pivoting is necessary are considered below.
- Browsing of modest amounts of data for an individual entity, optionally followed by data editing based on inter-attribute dependencies. This operation is facilitated by caching the modest amounts of the requisite supporting metadata. Some programs, such as TrialDB, access the metadata to generate semi-static Web pages that contain embedded programming code as well as data structures holding metadata.
- Bulk extraction transforms large amounts of data into a set of relational tables. While CPU-intensive, this task is infrequent and does not need to be done in real-time; i.e., the user can wait for a batched process to complete. The importance of bulk extraction cannot be overestimated, especially when the data is to be processed or analyzed with standard third-party tools that are completely unaware of EAV structure. Here, it is not advisable to try to reinvent entire sets of wheels through a generic framework, and it is best just to bulk-extract EAV data into relational tables and then work with it using standard tools.
- Ad hoc query interfaces to row- or EAV-modeled data, when queried from the perspective of individual attributes, must typically show the results of the query with individual attributes as separate columns. For most EAV database scenarios ad hoc query performance must be tolerable, but sub-second responses are not necessary, since the queries tend to be exploratory in nature.
Relational division
Optimizing pivoting performance
- One possible optimization is the use of a separate "warehouse" or queryable schema whose contents are refreshed in batch mode from the production schema. See data warehousing. The tables in the warehouse are heavily indexed and optimized using denormalization, which combines multiple tables into one to minimize performance penalty due to table joins. This is the approach that Kalido uses to convert highly normalized EAV tables to standard reporting schemas.
- Certain EAV data in a warehouse may be converted into standard tables using "materialized views", but this is generally a last resort that must be used carefully, because the number of views of this kind tends to grow non-linearly with the number of attributes in a system.
- In-memory data structures: One can use hash tables and two-dimensional arrays in memory in conjunction with attribute-grouping metadata to pivot data, one group at a time. This data is written to disk as a flat delimited file, with the internal names for each attribute in the first row: this format can be readily bulk-imported into a relational table. This "in-memory" technique significantly outperforms alternative approaches by keeping the queries on EAV tables as simple as possible and minimizing the number of I/O operations. Each statement retrieves a large amount of data, and the hash tables help carry out the pivoting operation, which involves placing a value for a given attribute instance into the appropriate row and column. Random Access Memory is sufficiently abundant and affordable in modern hardware that the complete data set for a single attribute group in even large data sets will usually fit completely into memory, though the algorithm can be made smarter by working on slices of the data if this turns out not to be the case.
Alternatives
EAV vs. the Universal Data Model
Originally postulated by Maier, Ullman and Vardi, the "Universal Data Model" seeks to simplify the query of a complex relational schema by naive users, by creating the illusion that everything is stored in a single giant "universal table". It does this by utilizing inter-table relationships, so that the user does not need to be concerned about what table contains what attribute. C.J. Date, however, pointed out that in circumstances where a table is multiply related to another, there is insufficient metadata within the database schema to specify unambiguous joins. When UDM has been commercialized, as in SAP BusinessObjects, this limitation is worked around through the creation of "Universes", which are relational views with predefined joins between sets of tables: the "Universe" developer disambiguates ambiguous joins by including the multiply-related table in a view multiple times using different aliases.Apart from the way in which data is explicitly modeled, EAV differs from Universal Data Models in that it also applies to transactional systems, not only query oriented systems as in UDM. Also, when used as the basis for clinical-data query systems, EAV implementations do not necessarily shield the user from having to specify the class of an object of interest. In the EAV-based i2b2 clinical data mart, for example, when the user searches for a term, she has the option of specifying the category of data that the user is interested in. For example, the phrase "lithium" can refer either to the medication, or a laboratory assay for lithium level in the patient's blood.
XML and JSON
An Open Schema implementation can use an XML column in a table to capture the variable/sparse information. Similar ideas can be applied to databases that support JSON-valued columns: sparse, hierarchical data can be represented as JSON. If the database has JSON support, such as PostgreSQL and SQL Server 2016 and later, then attributes can be queried, indexed and joined. This can offer performance improvements of over 1000x over naive EAV implementations., but does not necessarily make the overall database application more robust.Note that there are two ways in which XML or JSON data can be stored: one way is to store it as a plain string, opaque to the database server; the other way is to use a database server that can "see into" the structure. There are obviously some severe drawbacks to storing opaque strings: these cannot be queried directly, one cannot form an index based on their contents, and it is impossible to perform joins based on the content.
Building an application that has to manage data gets extremely complicated when using EAV models, because of the extent of infrastructure that has to be developed in terms of metadata tables and application-framework code. Using XML solves the problem of server-based data validation, but has the following drawbacks:
- It is programmer-intensive. XML schemas are notoriously tricky to write by hand, a recommended approach is to create them by defining relational tables, generating XML-schema code, and then dropping these tables. This is problematic in many production operations involving dynamic schemas, where new attributes are required to be defined by power-users who understand a specific application domain but are not necessarily programmers. By contrast, in production systems that use EAV, such users define new attributes through a GUI application. Because the validation-associated metadata is required to be stored in multiple relational tables in a normalized design, a GUI application that ties these tables together and enforces the appropriate metadata-consistency checks is the only practical way to allow entry of attribute information, even for advanced developers - even if the end-result uses XML or JSON instead of separate relational tables.
- The server-based diagnostics that result with an XML/JSON solution if incorrect data is attempted to be inserted are cryptic to the end-user: to convey the error accurately, one would, at the least, need to associate a detailed and user-friendly error diagnostic with each attribute.
- The solution does not address the user-interface-generation problem.
Tree structures and relational databases
There exist several other approaches for the representation of tree-structured data, be it XML, JSON or other formats, such as the nested set model, in a relational database. On the other hand, database vendors have begun to include JSON and XML support into their data structures and query features, like in IBM DB2, where XML data is stored as XML separate from the tables, using XPath queries as part of SQL statements, or in PostgreSQL, with a JSON data type that can be indexed and queried. These developments accomplish, improve or substitute the EAV model approach.The uses of JSON and XML are not necessarily the same as the use of an EAV model, though they can overlap. XML is preferable to EAV for arbitrarily hierarchical data that is relatively modest in volume for a single entity: it is not intended to scale up to the multi-gigabyte level with respect to data-manipulation performance. XML is not concerned per-se with the sparse-attribute problem, and when the data model underlying the information to be represented can be decomposed straightforwardly into a relational structure, XML is better suited as a means of data interchange than as a primary storage mechanism. EAV, as stated earlier, is specifically applicable to the sparse-attribute scenario. When such a scenario holds, the use of datatype-specific attribute-value tables than can be indexed by entity, by attribute, and by value and manipulated through simple SQL statements is vastly more scalable than the use of an XML tree structure. The Google App Engine, mentioned above, uses strongly-typed-value tables for a good reason.
Graph databases
An alternative approach to managing the various problems encountered with EAV-structured data is to employ a graph database. These represent entities as the nodes of a graph or hypergraph, and attributes as links or edges of that graph. The issue of table joins are addressed by providing graph-specific query languages, such as Apache TinkerPop, or the OpenCog atomspace pattern matcher.Considerations for server software
PostgreSQL: JSONB columns
PostgreSQL version 9.4 includes support for JSON binary columns, which can be queried, indexed and joined. This allows performance improvements by factors of a thousand or more over traditional EAV table designs.A db schema based on JSONB always has fewer tables: one may nest attribute-value pairs in JSONB type fields of the Entity table. That makes the db schema easy to comprehend and SQL queries concise.
The programming code to manipulate the database objects on the abstraction layer turns out much shorter.
SQL Server 2008 and later: Sparse columns
Microsoft SQL Server 2008 offers a alternative to EAV. Columns with an atomic data type can be designated as sparse simply by including the word SPARSE in the column definition of the CREATE TABLE statement. Sparse columns optimize the storage of NULL values and are useful when the majority records in a table will have NULL values for that column. Indexes on sparse columns are also optimized: only those rows with values are indexed. In addition, the contents of all sparse columns in a particular row of a table can be collectively aggregated into a single XML column, whose contents are of the form*....
In fact, if a column set is defined for a table as part of a CREATE TABLE statement, all sparse columns subsequently defined are typically added to it. This has the interesting consequence that the SQL statement SELECT * from
will not return the individual sparse columns, but concatenate all of them into a single XML column whose name is that of the column set. Sparse columns are convenient for business applications such as product information, where the applicable attributes can be highly variable depending on the product type, but where the total number of variable attributes per product type are relatively modest.Limitations of Sparse Attributes
However, this approach to modeling sparse attributes has several limitations: rival DBMSs have, notably, chosen not to borrow this idea for their own engines. Limitations include:- The maximum number of sparse columns in a table is 10,000, which may fall short for some implementations, such as for storing clinical data, where the possible number of attributes is one order of magnitude larger. Therefore, this is not a solution for modeling *all* possible clinical attributes for a patient.
- Addition of new attributes – one of the primary reasons an EAV model might be sought – still requires a DBA. Further, the problem of building a user interface to sparse attribute data is not addressed: only the storage mechanism is streamlined. * Applications can be written to dynamically add and remove sparse columns from a table at run-time: in contrast, an attempt to perform such an action in a multi-user scenario where other users/processes are still using the table would be prevented for tables without sparse columns. However, while this capability offers power and flexibility, it invites abuse, and should be used judiciously and infrequently.
- * It can result in significant performance penalties, in part because any compiled query plans that use this table are automatically invalidated.
- * Dynamic column addition or removal is an operation that should be audited, because column removal can cause data loss: allowing an application to modify a table without maintaining some kind of a trail, including a justification for the action, is not good software practice.
- SQL constraints cannot be applied to sparse columns. The only check that is applied is for correct data type. Constraints would have to be implemented in metadata tables and middle-tier code, as is done in production EAV systems.
- SQL Server has limitations on row size if attempting to change the storage format of a column: the total contents of all atomic-datatype columns, sparse and non-sparse, in a row that contain data cannot exceed 8016 bytes if that table contains a sparse column for the data to be automatically copied over.
- Sparse columns that happen to contain data have a storage overhead of 4 bytes per column in addition to storage for the data type itself. This impacts the amount of sparse-column data that you can associate with a given row. This size restriction is relaxed for the varchar data type, which means that, if one hits row-size limits in a production system, one has to work around it by designating sparse columns as varchar even though they may have a different intrinsic data type. Unfortunately, this approach now subverts server-side data-type checking.
Cloud computing offers
Google lets you operate on the data using a subset of SQL; Microsoft offer a URL-based querying syntax that is abstracted via a LINQ provider; Amazon offer a more limited syntax. Of concern, built-in support for combining different entities through joins is currently non-existent with all three engines. Such operations have to be performed by application code. This may not be a concern if the application servers are co-located with the data servers at the vendor's data center, but a lot of network traffic would be generated if the two were geographically separated.
An EAV approach is justified only when the attributes that are being modeled are numerous and sparse: if the data being captured does not meet this requirement, the cloud vendors' default EAV approach is often a mismatch for applications that require a true back-end database. Retrofitting the vast majority of existing database applications, which use a traditional data-modeling approach, to an EAV-type cloud architecture, would require major surgery. Microsoft discovered, for example, that its database-application-developer base was largely reluctant to invest such effort. More recently, therefore, Microsoft has provided a premium offering – a cloud-accessible full-fledged relational engine, SQL Server Azure, which allows porting of existing database applications with modest changes.
One limitation of SQL Azure is that physical databases are limited to 500GB in size as of 2015. Microsoft recommends that data sets larger than this be split into multiple physical databases and accessed with parallel queries.