Big data or data that is so large that traditional data processing methods can’t handle it is a big business. Approximately 3 quintillion bytes of data are created every day, and all these gigabytes need to be stored, analyzed, and kept secure so companies across every industry can use that data to collect customer information, manage human resources, track inventory, and more. According to statistics, the global Big Data and Analytics market is currently worth $70 billion and is predicted to grow to $103 billion by 2027.
The combined growth of machine learning, mobile data, cloud computing, and IoT has significantly pushed up Big Data spending, while the money invested in Big Data cost savings and regulations has increased fourfold since 2019. Among the leaders in terms of Big Data spending are
- The US ($110 billion on Big Data in 2021);
- Japan ($12.4 billion);
- China ($11.9 billion).
As for Big Data usage by industry, the leaders are
- Telecommunications (87%);
- Financial Services (76%);
- Healthcare (60%).
The healthcare industry is a big market for Big Data due to the growing demand for business intelligence solutions. Generating $20.12 billion in revenue in 2021, this market is expected to reach $79.23 billion by 2023, growing by an average of 28.9% annually.
Big Data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it
To date, cloud computing generates more than $400 billion in revenue, with SaaS (Software-as-a-Service) being the largest segment, with a size of $206 billion.
What Is a Database? Database Characteristics, Examples, and Reasons to Use
A database is a collection of information or data typically accessed electronically and used to support Online Transaction Processing (OLTP). Database management systems or DBMS store data in the database and enable users and apps to interact with the data. The term “database” is often used to reference both the database itself and the DBMS.
Over the last several decades, a variety of database types has emerged. Relational databases organize data in tables with fixed rows and columns, while non-relational databases (or NoSQL databases) store data in a variety of models, such as
- BSON (Binary JSON);
- Key-value pairs;
- Tables with rows and dynamic columns;
- Nodes and edges.
Also, depending on the type, databases store structured or semi-structured data. Other database characteristics may also include:
- Optimizations for mobile devices;
- Security features ensuring only authorized users can access data;
- Full-text search;
- ACID (Atomicity, Isolation, Durability) transactions to ensure data integrity;
- Indexes to optimize query performance;
- Query languages and APIs to interact with the data in the database;
- On-premises, private cloud, public cloud, hybrid cloud, or multi-cloud hosting options;
- Flexible deployment topologies to isolate workloads to a specific set of resources.
Why use a database? Any application, which needs to store data, requires a database. Apps across various industries and use cases are built on databases that allow them to store many types of data, such as
- Patient medical records;
- Financial records;
- Online gaming information;
- Items in an online store;
- Sports scores and statistics;
- IoT device readings;
- Student grades and scores;
- Articles and blog entries;
- Mobile application information.
To date, hundreds of different databases exist. Some examples include:
|Document databases||DynamoDB; MongoDB; CosmosDB; CouchDB.|
|Relational databases||Oracle; Microsoft SQL Server; MySQL; PostgreSQL.|
|Wide-column stores||Apache Cassandra; Apache HBase; ScyllaDB; Microsoft Azure Cosmos DB; Google Big Table.|
|Key-value databases||Amazon DynamoDB; Riak; Redis; Oracle No SQL.|
|Graph databases||MarkLogic; Amazon Neptune; Nebula Graph; Neo4j.|
What are Data Lakes And Data Warehouses? Characteristics And Use Cases
Both data lakes and data warehouses are created to support Online Analytical Processing (OLAP). OLAP is a system for performing multi-dimensional analysis at high speeds on large volumes of data from various sources. OLAP is ideal for data mining, complex analytical calculations, and business intelligence. These systems are also often used for business reporting functions like financial analysis, sales forecasting, and budgeting. There are two different approaches to OLAP - data lakes and data warehouses.
A data warehouse is a blend of technologies and components for the strategic use of data. This system stores highly structured information from various sources to provide meaningful business insights. The primary goal of using a data warehouse is to combine disparate data sources to analyze the data, create business intelligence (BI), and look for insights in the form of reports and dashboards.
A data warehouse is a giant database optimized for analytics. It stores large amounts of current and historical data from various sources, from raw ingested data to highly curated, filtered, cleansed, and aggregated data. There are two approaches to constructing data warehouses: Top-down and Bottom-up. Below, we’ve highlighted the difference between them.
The essential components of the Top-down approach include:
External Sources. An external source is a source from where data is collected. The data can be structured, semi-structured, and unstructured. Stage Area. As the data extracted from external sources don’t follow a particular format, it is necessary to validate it to load into the data warehouse. For this purpose, the ETL tool is used:
- E (Extract): Data is extracted from an external source.
- T (Transform): Data is transformed into the standard format.
- L (Load): Data is loaded into the warehouse.
Data Warehouse. After ETL processes, data is stored in the data warehouse as a central repository. The data warehouse stores the metadata, while the actual data is stored in the data marts. In the Top-down approach, the data is stored in the data warehouse in its purest form. Data Marts. A data mart is a part of the storage component, which stores information about a specific organization function handled by a single authority. Depending on the organization’s operations, there can be as many data marts. Data Mining. Data mining is analyzing big data in the data warehouse to find hidden patterns.
Creating data marts from a data warehouse is easy and provides a consistent dimensional view of data marts. The Top-down approach is considered the strongest model for business changes, although it is comparably costly and takes time to design and maintain.
The Bottom-up approach includes three main steps:
- The data is extracted from external sources.
- The data goes through the staging area (ETL) and is loaded into data marts that are created first and provide reporting capability, addressing a single business area.
- The data marts are integrated into the data warehouse.
Since the data marts are created first, the reports are quickly generated. Also, this approach allows for more data marts, extending the data warehouse. Compared with the Top-down approach, the cost and time taken to design the Bottom-up model are low. However, this approach is not as strong as the Top-down model as the dimensional view of data marts is not consistent.
Data warehouses are a good option when storing large amounts of historical data or performing in-depth data analysis to generate BI. Due to the highly structured nature of data warehouses, data analysis is relatively easy and can be performed by data scientists and business analysts. However, data warehouses aren’t designed to meet an application's transaction and concurrency needs, so if an organization chooses a data warehouse, it will need a separate database to run its day-to-day operations.
Examples of data warehouses include:
- Google BigQuery;
- Amazon Redshift;
- IBM Db2 Warehouse;
- Microsoft Azure Synapse;
- Teradata Vantage.
A data lake represents a centralized storage repository designed to store a large amount of structured, semi-structured, and unstructured data from disparate sources. It is a place to store data in various formats, including JSON, BSON, TSV, CSV, ORC, Avro, and Parquet, with no fixed limits on account size or file. Similarly to data warehouses, the primary purpose of data lakes is to analyze the data to gain insights. Unlike data warehouses, which are giant databases, data lakes are repositories for data stored in various ways, including databases. Tools like Presto, Starburst, Atlas Data Lake, and Demio can perform the same analytic workloads as a data warehouse and provide a database-like view of the data stored in the data lake.
A data lake allows storing data in its raw format until it is needed and can be used for data exploration, data analytics, and machine learning. Each data element in a data lake is assigned a unique ID and tagged with a set of extended metadata tags that offer various analytic capabilities.
The main components of data lake architecture that make it functional are:
- Data Ingestion and Storage. Data lakes get information from databases, apps, and data warehouses. The key difference between data lakes and data warehouses is that a data lake doesn’t need to convert or refine data before storage. The ingested data is stored in a central storage unit until it is converted into open data formats and optimized for consumption.
- Data Processing. After the data conversion into open data formats, it is prepared for consumption via engineering pipelines. Depending on the scale, the open data might need large scalable pipelines to be ready for data consumption. Data processing engines like Apache Hive and Apache Spark can be used to achieve the desired performance and results.
- Data Governance. One surefire way to control the data security in a data lake is to encrypt the information. As the data is stored in the cloud, cloud providers distribute encryption keys to their clients to provide access to the data stored in the data lake and restrict data consumption by unauthorized personnel.
- Data Analysis. Before preparing the data in a data lake for analysis, the information is expressed and translated via data tagging techniques. The domain-specific programming language SQL is used to analyze data in a data lake, especially for business purposes.
Examples of data lakes include:
- Google Cloud Storage;
- AWS S3;
- Azure Data Lake Storage Gen2.
- AWS Athena;
- MongoDB Atlas Data Lake.
Data Lakes vs. Data Warehouses. What is the Difference?
Databases, data lakes, and data warehouses are all used to store data. The key differences between them are:
- A database stores the current data that is required to power an app.
- A data warehouse stores current or historical data from different sources in a pre-defined and fixed schema, making it easy for data scientists and business analysts to analyze the data.
- A data lake stores current or historical data from disparate sources in its raw form, allowing business analysts and data scientists to analyze the data easily.
The table below summarizes the differences between data warehouses and data lakes.
|Parameters||Data Warehouse||Data Lake|
|History||The concept has been used for decades||Big Data technologies are relatively new|
|Storage||Data is cleaned and transformed||Data is kept in the raw form|
|Data Type||Structured and/or semi-structured||Structured, semi-structured, and/or unstructured|
|Schema Flexibility||Pre-defined and fixed schema definition for ingesting||No schema definition is required for ingesting|
|Users||Ideal for operational users as it is well structured and easy to use and understand||Data scientists who need advanced analytical tools for deep analysis with capabilities such as predictive modeling and statistical analysis|
|Task||Provides insights into pre-defined questions for pre-defined data types||Allows users to access data before the transformation, cleansing, and structuring process|
|Storage Costs||Storing data is costly and time-consuming||Relatively inexpensive|
|Processing Time||Any changes to the data warehouse require more time than to data lakes||Allows users to get to their results more quickly than a data warehouse|
|Pros||The fixed schema allows business analysts to analyze the data easily. Speedy data retrieving. Error identification and correction.||Easy data storage. Separate storage and compute. Integrate different types of data. Versatility. Scalability. Schema flexibility. Supports not only SQL but other languages.|
|Cons||Difficult to design schema. Maintenance costs. Time-consuming preparation. Compute scaling may require storage scaling. Difficulty in compatibility.||Data is stored in the raw form and is transformed when it’s ready to be used. Requires efforts to organize and prepare data for analysis.|
When to Use a Data Lake vs. a Data Warehouse? The Future of Both Data Platforms
When choosing between data lakes and data warehouses, organizations often need both. Data lakes have emerged from the need to use big data and benefit from the raw, structured, and unstructured data for machine learning, although there is still a need to create data warehouses for analytics use by businesses.
Without big data analytics, companies are blind and deaf, wandering out onto the web like deer on a freeway
Data warehouses have long been used in the healthcare industry, although due to the unstructured nature of much data in this industry (physician notes, clinical data, etc.) and the need for real-time analysis, this type of database is generally not an ideal model. Data lakes are a better fit for healthcare companies due to their ability to combine structured and unstructured data.
Similarly to many other business settings, data warehouses are considered the best storage model in finance as they can be structured for access by the whole company rather than a data scientist. Big data technology has helped the financial services industry achieve great success, and data warehouses have played a significant role in it.
To date, big data plays a crucial role in education. Data on student attendance, grades, and other education metrics can help students get back on track in the event of any problem and can also help predict potential issues before they occur. Moreover, flexible big data solutions also help simplify billing and improve fundraising. This data is usually vast and raw, so most often, institutions in the education sphere benefit best from the flexibility offered by data lakes.
Much of the benefit of data lake analysis lies in the ability to make predictions, which makes them an excellent choice for the transportation industry. In supply chain management, predicting ability can have huge benefits, such as cost savings by examining data from forms in a transport pipeline.
The differences in structure, processes, users, and general flexibility of a data lake and a data warehouse make each model unique. Technologies in both these models continue to evolve, and developments in both software and hardware are currently aimed at making data warehouses faster, more scalable, and more reliable. When to use data lake vs. data warehouse depends on the company’s needs, and choosing the right database model will drive its growth and development.