CloverDX is a new name for CloverETL Learn more
Data storage has come a long way. Here's a quick run through how we got to where we are today.
Computers have been using various forms of external data storage since they were invented. From punched cards and tapes, through magnetic tapes, diskettes and optical disks to today’s solid-state disks and modern hard drives. As the capacity and performance of storage has increased, the data processing requirements have also increased. Now systems need to process much more than just data created by a single application, and the challenge has become how to create the most efficient and effective data storage.
Evolutionary changes have defined the path data storage has taken, but what does next generation data architecture potentially look like, and what preparations should we make to optimise these potential developments in the future?
Let’s make a quick excursion through the data storage evolution.
Data storage originally focused only on the needs of a single isolated application. Since the storage capacity was small, and storage was expensive, the method of storing data needed to be efficient and effective.
Each individual application stored just the necessary input and output data in the format and the structure most suited to that particular application’s needs.
For example, a business’ complete monthly wage processing was centrally processed using the dedicated mainframe program. Neither the input or output data were shared among any other programs or tasks.
As the performance and storage capacity of computers increased, modern computers began to be involved in more applications and business processes. Common data started to be reused among multiple applications, and natural groups of data and applications began to be formed within organizations.
These groups centered either on common functional agendas or business units in the organization. Even though the data could be shared among multiple applications within the same group, each group still remained isolated from the others. These are data silos.
Data silos created a negative impact on data integrity, making it virtually impossible to get a ‘single point of truth’, as well as not being particularly good for overall corporate productivity, with lots of repetitive tasks and wasted effort.
There were multiple approaches for how to integrate and share application data beyond the silo limits. The natural solution lay in the creation of a centralized repository combining, integrating and consolidating the data from the multiple silos. Such a central repository can mediate all data for combined views and provide a single version of the truth.
The straightforward approach is building an overall enterprise wide repository with strong normalization in order to overcome data duplicities and wasted storage space. This approach is the basis for the top-down design of the data warehouse (DWH) as introduced by Bill Inmon.
Making such a huge data store is a challenging task, and in the case of the large corporates it can be doomed because of the high cost involved. Another issue with establishing a data warehouse comes from the need to validate the data in order to fit all business constraints – losing invalid data which doesn’t fit the model means the loss of information which could be useful in the future.Find out how businesses are managing their data by reading our Guide to Enterprise Data Architecture
Dan Linsted introduced a major improvement for structured central repositories, known as data vault modelling. A single version of a fact is stored rather than the single version of the truth. This means that even facts not conforming to the single version of truth are preserved, unlike in a data warehouse.
The storage model is based on the hub and spoke architecture. The business keys and the associations within source data are separated into hub tables (holding entity business keys) and spoke tables (holding associations among business keys).
These tables do not contain any descriptive attributes, facts and/or temporal attributes but the business keys, load data source and audit information. Thus, the structure of the relationship of the entities is separated from their details that are stored in satellite tables linked to each particular hub or spoke table.
This is a very strong design but even more difficult and expensive to build.
Both normalized data warehouses and data vault storages are not always a good solution for presenting data for analysis, reporting and BI.
To overcome this problem, the concept of the data mart was introduced. The data mart is focused on a single business process or functional area, combining data taken either from the data warehouse or data vault storage and presenting it in a dimensional model that is efficient for reporting and analytical query purposes (OLAP processing).
Typically, there are multiple data marts within an organization.7 questions you need to ask when choosing your data architecture
As already mentioned, both normalized data warehouses and data vault storages are hard to build.
Since the IT world has been heading more towards agile environments, business users began calling for the immediate availability of integrated data. Ralph Kimball brought an easy answer called bottom-up design.
The idea behind this approach was ‘let’s not bother with the overall normalized enterprise model first and go directly to designing data marts based on the source data!’. This brilliant idea decreased the costs required to build centralized data storages, as the data can be incrementally provided to the business.
There are some drawbacks, as the incrementally deployed data marts must be somehow conformed.
The solution is based on conforming dimensions and facts and the data warehouse bus matrix approach. The major disadvantage compared to older approaches is that historical data from areas not covered by actually deployed data marts might be lost because of the source system retention.
The dawn of big data technologies and virtually unlimited storage enabled the birth of an alternate approach called the data lake.
A data lake stores data in its natural format in a single repository, no matter whether the data is structured (tables, indexed files), semi-structured (XML, JSON, text files) or even unstructured (documents, emails, audio and video clips, photos or document scans).
No source data should be lost as a result of data validation and cleansing, while only a small part of the data is usually necessary to use for analytical and reporting purposes. The data can later be analysed directly using various tools (e.g. Apache Map Reduce, Pig, Hive, Spark or Flink), or transformed to data marts. Nevertheless, adequate data management is necessary to govern a data lake. Even though the data lake approach preserves all data, it suffers from the lack of a structure and a single point of truth.
A step further along the data storage evolution brings us to an approach that can combine the strongest features of all historical approaches while eliminating the downsides. It can be named the central discovery storage and employs a modern document NoSQL database.
Data from source systems is collected and after some optional transformations, is loaded into the document database collections for permanent storage. This approach ensures that no historical data is lost while some structure can be introduced using the ETL (extract-transform-load) process.
Additional manual and automated ETL jobs can be used for further data transformations within the permanent storage to extend the structure in the existing document collection, introducing new attributes and creating new document collections. This can be done incrementally, as all original and derived data is permanently stored.
Finally, the data from permanent storage collections can be transformed using additional ETL jobs into document collections.
Compared to historical approaches, in this central discovery storage method all historical data can be retained without losing any records, the structure of the stored data can be constructed incrementally as needed, and data marts for business users can be deployed in an agile way. Finally, thanks to the ETL transformations performed in the multiple places, the overall storage can be as effective and efficient as possible.White paper: Your Guide to Enterprise Data Architecture