Modern organisations, whether for-profit businesses or non-profit entities, acquire huge amounts of data from their users. This data may be created automatically from online interactions or inputted manually by employees based on face-to-face interactions with customers and clients.
Indeed, big data is now big business. The purchase and sale of data have become the primary goal and earner of some companies. Whether for the purposes of selling stored data to the highest bidder or to gain insights on a user’s journey across all channels and touchpoints the organisation offers to optimise an Omnichannel strategy, it’s clear that storing, processing and using data effectively are the new frontier of tech companies.
Data needs to be stored somewhere safe, but also needs to be accessible to authorised personnel for the purposes of extraction and analysis. Data storage solutions come in many forms. Two of the best and most sophisticated data architecture systems that are used with Customer Data Platforms are data lakes and data warehouses.
A data lake (formerly known as a data directory) is a place to store enormous amounts of data in its raw, unprocessed and unorganised form. A data lake can be real, physical computer hardware or, as is more often the case, it can exist only in the cloud. If we imagine every data point is a drop of water, this gives a good idea of how big a data lake is, and how unstructured it is at the same time.
A data lake is a good data management system for large organisations that have a lot of data streams entering regularly with data in various different structural forms, and who need a place to store it until it can be sorted and the useful parts siphoned into a separate section. Data lakes are also suitable for data scientists and data engineers who use huge, deep datasets for data mining.
Read more: Types of Data Sources and How CDPs Can Help after the Death of Cookies
The most obvious disadvantage of using a data lake is that it can easily lead organisations to accumulate and store massive amounts of information that is not necessary, will never be used and only takes up valuable storage space. Another drawback of data lakes is the difficulty of finding and extracting the dataset you need at any moment, like trying to recover one specific H2O molecule from a lake.
For this reason, data lakes are rarely used in isolation. They are normally coupled with a more organised data management system…
A data warehouse is for filtered data that has already been processed for a specific purpose, which it stores in a structured way. Data warehouses are mostly used by medium- to large-sized enterprises, and can be used with or without a data lake.
Data warehouse architecture
The advantage of using a data warehouse is that it makes historical data easily accessible. Each dataset or separate database can be siloed in its own private compartment and shared with the people who are granted access to it, thereby increasing security.
The downside of data warehouses is that it takes longer to setup and input the data into the correct place initially. However, most organisations find that the speed of recalling the data when it’s needed more than offsets this. A more complicated system with more parts that requires more expert staff to maintain it is also more expensive.
The difference between a data lake and a data warehouse is that a data lake is used to store large amounts of unrefined data, whereas a data warehouse stores processed data in a more organised way. They are not mutually exclusive, and can often work together for different use cases.
While we’re on the subject, let’s quickly clear up some of the other terms associated with data storage systems (many of them aquatically themed).
Database: is used by all organisations. It’s a system that doesn’t have a lot of capacity, is used for analysing small datasets, and can be stored in a data warehouse or a data lake. Think of an Excel spreadsheet as an example.
Data Ocean: is bigger than a data lake. While a data lake may only be used as a repository for data generated by, for example, your IT department, a data ocean collects and stores raw, unprocessed data from all parts of the organisation.
Data Pond: is smaller than a data lake. A data pond is a siloed subsection of a data lake that has been isolated for the sake of privacy, technological constraints, cost saving, or to aid in the processing and management of the data at a later date.
Data Swamp: is a problem that can arise from misusing a data lake. If the storage process or the data itself are not cared for with a proper security policy, or if there is a lack of documentation about the use of data sharing, the pool will turn into a stagnant marsh of questionable legality and insecure data.
Data Hub: is somewhere between a warehouse and a lake. Like a data lake, a data hub also stores raw data, but it’s not quite so unorganised.
Data Fabric: is a web of connections to join up the datapoints in a data lake to make it more organised. In a similar way to the data hub, data fabric tries to improve the data lake architecture to conserve its benefits and mitigate the disadvantages.
Data Mart: is a part of a data warehouse. It may contain information and insights that relate only to a subsection of the company. For example, you might have a data mart for your sales team as a kind of silo to hold data specific only to them.
There are three main methods for designing data marts and data warehouses:
Star Schema: is the simplest data modelling method to construct data marts and data warehouses. They have one central fact table and up to four dimension tables directly coming off it. The star schema design is better for data marts than data warehouses.
Snowflake Schema: is slightly more complex than a star schema because each dimension can host other dimensions if you want. This makes it a better method for designing larger data warehouses instead of data marts.
Data Vault Modelling: is a modern, agile way of designing and building efficient, effective Data Warehouses. A data vault is a more scalable and intelligent design for a data warehouse than a snowflake schema because data auditing, tracing and adaptability to change is all much easier.
A Customer Data Platform (CDP) does not replace the need for a data lake or a data warehouse. In fact, all CDPs need a form of data storage, and data lakes and data warehouses are the most common.
Where a CDP differs from data lakes, data warehouses and other data storage platforms is with the ability to create a sharable customer database for each user, giving rise to a cross-channel, 360-degree customer view across all datasets.
Some CDPs can also analyse the data in this personalised customer profile and history. Depending on how advanced the type of CDP is, it can even suggest and implement marketing and retail actions to engage the customer in the way that is most suited to them.
The most sophisticated and powerful kind of CDP, like the Antsomi CDP 365, has been termed a “Delivery CDP” by the CDP Institute. It can help you not only store data in the cloud and analyse it, but also extract insights and act upon them. Data lakes and data warehouses, as mere repositories of information, cannot do this, but without them it would not be possible to realise all the complex functions of a CDP.
For a discussion of the differences between a DMP, CRM and a CDP, see our post on The Evolution of the Customer Data Platform
SmartOSC offers advice, information and software services for CDPs and digital transformation projects. Contact us now to get a CDP for your organisation to start getting more user insights.