Data transformation

In computing, Data transformation is the process of converting data from one format or structure into another format or structure. It is a fundamental aspect of most data integration and data management tasks such as data wrangling, data warehousing, data integration and application integration.
Data transformation can be simple or complex based on the required changes to the data between the source data and the target data. Data transformation is typically performed via a mixture of manual and automated steps. Tools and technologies used for data transformation can vary widely based on the format, structure, complexity, and volume of the data being transformed.
A master data recast is another form of data transformation where the entire database of data values is transformed or recast without extracting the data from the database. All data in a well designed database is directly or indirectly related to a limited set of master database tables by a network of foreign key constraints. Each foreign key constraint is dependent upon a unique database index from the parent database table. Therefore, when the proper master database table is recast with a different unique index, the directly and indirectly related data are also recast or restated. The directly and indirectly related data may also still be viewed in the original form since the original unique index still exists with the master data. Also, the database recast must be done in such a way as to not impact the applications architecture software.
When the data mapping is indirect via a mediating data model, the process is also called data mediation.

Data Transformation Process

Data transformation can be divided into the following steps, each applicable as needed based on the complexity of the transformation required.

Data discovery
Data mapping
Code generation
Code execution
Data review

These steps are often the focus of developers or technical data analysts who may use multiple specialized tools to perform their tasks.
The steps can be described as follows:
Data discovery is the first step in the data transformation process. Typically the data is profiled using profiling tools or sometimes using manually written profiling scripts to better understand the structure and characteristics of the data and decide how it needs to be transformed.
Data mapping is the process of defining how individual fields are mapped, modified, joined, filtered, aggregated etc. to produce the final desired output. Developers or technical data analysts traditionally perform data mapping since they work in the specific technologies to define the transformation rules.
Code generation is the process of generating executable code that will transform the data based on the desired and defined data mapping rules. Typically, the data transformation technologies generate this code based on the definitions or metadata defined by the developers.
Code execution is the step whereby the generated code is executed against the data to create the desired output. The executed code may be tightly integrated into the transformation tool, or it may require separate steps by the developer to manually execute the generated code.
Data review is the final step in the process, which focuses on ensuring the output data meets the transformation requirements. It is typically the business user or final end-user of the data that performs this step. Any anomalies or errors in the data that are found and communicated back to the developer or data analyst as new requirements to be implemented in the transformation process.

Types of Data Transformation

Batch Data Transformation

Traditionally, data transformation has been a bulk or batch process, whereby developers write code or implement transformation rules in a data integration tool, and then execute that code or those rules on large volumes of data. This process can follow the linear set of steps as described in the data transformation process above.
Batch data transformation is the cornerstone of virtually all data integration technologies such as data warehousing, data migration and application integration.
When data must be transformed and delivered with low latency, the term “microbatch” is often used. This refers to small batches of data that can be processed very quickly and delivered to the target system when needed.

Benefits of Batch Data Transformation

Traditional data transformation processes have served companies well for decades. The various tools and technologies have matured and most enterprises transform enormous volumes of data that feed internal and external applications, data warehouses and other data stores.

Limitations of Traditional Data Transformation

This traditional process also has limitations that hamper its overall efficiency and effectiveness.
The people who need to use the data do not play a direct role in the data transformation process. Typically, users hand over the data transformation task to developers who have the necessary coding or technical skills to define the transformations and execute them on the data.
This process leaves the bulk of the work of defining the required transformations to the developer. The developer interprets the business user requirements and implements the related code/logic. This has the potential of introducing errors into the process, and also increases the time to arrive at a solution.
This problem has given rise to the need for agility and self-service in data integration.
There are companies that provide self-service data transformation tools. They are aiming to efficiently analyze, map and transform large volumes of data without the technical and process complexity that currently exists. While these companies use traditional batch transformation, their tools enable more interactivity for users through visual platforms and easily repeated scripts.

Interactive Data Transformation

Interactive data transformation is an emerging capability that allows business analysts and business users the ability to directly interact with large datasets through a visual interface, understand the characteristics of the data, and change or correct the data through simple interactions such as clicking or selecting certain elements of the data.
Although IDT follows the same data integration process steps as batch data integration, the key difference is that the steps are not necessarily followed in a linear fashion and typically don't require significant technical skills for completion.
A number of companies, primarily start-ups such as Trifacta, Alteryx and Paxata provide interactive data transformation tools. They are aiming to efficiently analyze, map and transform large volumes of data without the technical and process complexity that currently exists.
IDT solutions provide an integrated visual interface that combines the previously disparate steps of data analysis, data mapping and code generation/execution and data inspection. IDT interfaces incorporate visualization to show the user patterns and anomalies in the data so they can identify erroneous or outlying values.
Once they've finished transforming the data, the system can generate executable code/logic, which can be executed or applied to subsequent similar data sets.
By removing the developer from the process, IDT systems shorten the time needed to prepare and transform the data, eliminate costly errors in interpretation of user requirements and empower business users and analysts to control their data and interact with it as needed.

Transformational languages

There are numerous languages available for performing data transformation. Many transformation languages require a grammar to be provided. In many cases, the grammar is structured using something closely resembling Backus–Naur Form. There are numerous languages available for such purposes varying in their accessibility and general usefulness. Examples of such languages include:

AWK - one of the oldest and popular textual data transformation language;
Perl - a high-level language with both procedural and object-oriented syntax capable of powerful operations on binary or text data.
Template languages - specialized to transform data into documents ;
TXL - prototyping language-based descriptions, used for source code or data transformation.
XSLT - the standard XML data transformation language ;

Additionally, companies such as Trifacta and Paxata have developed domain-specific transformational languages for servicing and transforming datasets. The development of domain-specific languages has been linked to increased productivity and accessibility for non-technical users. Trifacta's “Wrangle” is an example of such a domain specific language.
Another advantage of the recent DSL trend is that a DSL can abstract the underlying execution of the logic defined in the DSL, but it can also utilize that same logic in various processing engines, such as Spark, MapReduce, and Dataflow. With a DSL, the transformation language is not tied to the engine.
Although transformational languages are typically best suited for transformation, something as simple as regular expressions can be used to achieve useful transformation. A text editor like vim, emacs or TextPad supports the use of regular expressions with arguments. This would allow all instances of a particular pattern to be replaced with another pattern using parts of the original pattern. For example:


foo ;
bar ;
foo ;
bar ;

could both be transformed into a more compact form like:
foobar;
foobar;
In other words, all instances of a function invocation of foo with three arguments, followed by a function invocation with two arguments would be replaced with a single function invocation using some or all of the original set of arguments.
Another advantage to using regular expressions is that they will not fail the null transform test. That is, using your transformational language of choice, run a sample program through a transformation that doesn't perform any transformations. Many transformational languages will fail this test.

Popular movies

The Hunger Games (film) - 2012 American dystopian action thriller science fiction-adventure film directed by Gary Ross and based on Suzanne Collins’s 2008 novel of the same name. It is the first insta...
untitled Captain Marvel sequel - part of Marvel Cinematic Universe....
Killers of the Flower Moon (film project) - Killers of the Flower Moon - film project in United States of America. It was presented as drama, detective fiction, thriller. The film project starred Leonardo Dicaprio, Robert De Niro. Director of...
Five Nights at Freddy's (film) - Five Nights at Freddy's - film published in 2017 in United States of America. Scenarist of the film - Scott Cawthon....

Popular books

Book of Revelation - The Book of Revelation is the final book of the New Testament, and consequently is also the final book of the Christian Bible. Its title is derived from the first word of the Koine Greek text: apok...
Book of Genesis - account of the creation of the world, the early history of humanity, Israel's ancestors and the origins...
Gospel of Matthew - The Gospel According to Matthew is the first book of the New Testament and one of the three synoptic gospels. It tells how Israel's Messiah, rejected and executed in Israel, pronounces judgement on ...
Michelin Guide - Michelin Guides are a series of guide books published by the French tyre company Michelin for more than a century. The term normally refers to the annually published Michelin Red Guide , the oldest...
Psalms - The Book of Psalms , commonly referred to simply as Psalms , the Psalter or "the Psalms", is the first book of the Ketuvim , the third section of the Hebrew Bible, and thus a book of th...
Ecclesiastes - Ecclesiastes is one of 24 books of the Tanakh , where it is classified as one of the Ketuvim . Originally written c. 450–200 BCE, it is also among the canonical Wisdom literature of the Old Tes...
The 48 Laws of Power - non-fiction book by American author Robert Greene. The book...

Popular television series

The Crown (TV series) - historical drama web television series about the reign of Queen Elizabeth II, created and principally written by Peter Morgan, and produced by Left Bank Pictures and Sony Pictures Tel...
Friends - American sitcom television series, created by David Crane and Marta Kauffman, which aired on NBC from September 22, 1994, to May 6, 2004, lasting ten seasons. With an ensemble cast sta...
Young Sheldon - spin-off prequel to The Big Bang Theory and begins with the character Sheldon...
Modern Family - American television mockumentary family sitcom created by Christopher Lloyd and Steven Levitan for the American Broadcasting Company. It ran for eleven seasons, from September 23...
Loki (TV series) - upcoming American web television miniseries created for Disney+ by Michael Waldron, based on the Marvel Comics character of the same name. It is set in the Marvel Cinematic Universe, shar...
Game of Thrones - American fantasy drama television series created by David Benioff and D. B. Weiss for HBO. It...
Shameless (American TV series) - American comedy-drama television series developed by John Wells which debuted on Showtime on January 9, 2011. It...