The Ultimate Manual For Data Lakes

The digital universe is doubling with every passing year.

The total amount of data created, captured, copied, and consumed globally was expected to reach over 59 zettabytes last year. Considering one zettabyte equals one trillion gigabytes, you can imagine the colossal amount of data we’re talking about here.

Moreover, if you’re even slightly involved with big data, you’ll know how crucial finding an adequate and secure storage solution for large data volumes is. Plus, isn’t it better to have data that is neatly presented, structured, and simple to understand?

Data lakes make it possible.

Data lakes are storage repositories that solve a crucial two-fold challenge: they let you store all your data securely, and they maintain the capacity to process the data quickly.

To understand whether data lakes are the best choice for your organization, let’s explore what they are and how they work.

What Is a Data Lake Anyway?

“If you think of a datamart as a store of bottled water–cleansed and packaged and structured for easy consumption–the data lake is a large body of water in a more natural state.” – James Dixon

Imagine you have a large lake with several tributaries feeding into it.

A data lake is this lake—or a large container—with structured data, unstructured data, machine-to-machine data, and logs that flow through in real-time. As you may have guessed, the different types of data are like the tributaries feeding into the lake.

A data lake is a centralized storage repository that can store massive amounts of structured, semi-structured, and unstructured data. All the data is stored in its native format with no fixed limits on account size or file.

The main aim of a data lake is to democratize data and make it more cost-effective to store it for later processing. Moreover, it provides high data quantity to boost analytic performance and native integration, too.

According to an Aberdeen survey, organizations that implemented a data lake outperformed similar companies by 9% in organic revenue growth. These organizations could perform new types of analytics like machine learning over new sources like data from click streams, log files, social media, and internet-connected devices stored in the data lake.

As a result, they were able to identify and act upon business growth opportunities faster. Not only did the companies successfully attract and retain more customers, but they also improved their productivity levels, made better decisions, and maintained their devices proactively.

How Do Data Lakes Work?

As mentioned before, data lakes are huge repositories for collecting and storing data in its original format until other downstream processes require it.

Their primary utility is to have a singular source for all data in a company—raw data, prepared data, and third-party data assets. You can use each of these data types to fuel various operations, such as data transformation, interactive analytics, reporting, and machine learning.

Let’s not forget how important organization, governance, and servicing of the data are for managing an effective production data lake.

How Are Data Lakes Implemented?

You can implement a data lake on-premise—but you need to have the budget and capacity to handle any complications that may surface during scalability.

That’s why most enterprises prefer opting for cloud-based data lakes and sign up with Amazon AWS, Microsoft Azure, and Google BigQuery. They’re more convenient and cost-effective, after all.

Data lakes operate on three core principles:

1. Openness
A data lake should be structured to take in as much data as possible without any adverse effect on its functioning. Lake designers need to have a long-term perspective and think ahead to accommodate future requirements effectively.

2. Adaptability
A data lake should be able to support any possible business usage, ranging from analytics to supporting applications. Again, lake designers have expectations cut out for them, where they need to build a structure that can serve as a solid foundation for future development.

3. Collaboration
Organizations that store all the data in a single repository may expect an organization-wide return on investment. This can be in greater insight accuracy, improved collaboration, or better understanding of customers.

Additionally, data lakes work on a principle called schema-on-read. This means there’s no predefined schema into which data has to be fitted before storage.

Only after the data is read and processed is it broken down and adapted into a schema.

Example 1: Sisense Developing Data Lake

Sisense is a leading global provider of business intelligence software that facilitates data-driven decision-making.

Its product usage logs are one of its most prominent data sources that store user mannerisms when interacting with Sisense‘s server, browser, and cloud-based applications. Over time, and of course, because of the rapid growth in the company’s customer base, Sisense had over 70 billion records. They needed effective management and analysis solutions for this data, and they needed it quickly.

Sisense decided to use a data lake architecture and started to build one using the AWS ecosystem. They set up a system where Amazon Kinesis streamed the product logs, Upsolver processed it, followed by writing columnar CSV and Parquet files to S3.

Finally, Sisense used its own software for visualization and business intelligence. The developed structured tables are then sent to Athena for ad-hoc analysis and data science use cases.

Example 2: Depop Switch to a Data Lake

Depop is a peer-to-peer social shopping application, where users take different actions, like following, messaging, purchasing, and selling. This ended up creating an ever-going stream of events.

To manage these events more effectively, the developers decided to create replicas of the data on Redshift. But they soon realized that doing this will make performance tuning and schema maintenance resource-intensive and unmanageable—the exact opposite of why they had installed the replicas in the first place.

Keeping this in mind, Depop decided to adopt a data lake approach using Amazon S3. There they created three different pipelines: Ingest, Fanout, and Transform. Here’s what the three pipelines looked like:

Ingest. The messages were written by RabbitMQ and then dispatched through a fanout lambda function.
Fanout. The lambda function develops the required AWS infrastructure based on the event type and then creates an AWS Kinesis stream.
Transform. The last pipeline was responsible for creating columnar Parquet files from the raw JSON data, which is handled using the AWS Glue ETL and Crawler. From there, the data is sent over to Athena for further analysis.

How to Get Started With a Data Lake

At this stage, you know what a data lake is, how to implement it, and how companies have leveraged it in their favor.

Now, let’s discuss how you can select a data lake technology and use relevant tools to set up an excellent data lake solution for your business.

Step 1: Choose and Set Up a Data Lake Solution

You can deploy platforms like AWS to set up a cloud-based data lake.

Your aim should be to pick a solution that uses serverless services underneath to avoid paying huge upfront costs and has a variable cost structure that lets you scale based on the amount of data you put in.

Step 2: Identify All Your Data Sources

You must identify all your data resources and the frequency of adding data to the data lake.

You’ll also have to decide whether to add the data sets as-is or do the necessary data cleaning and transformation. And while you’re at it, make sure you also identify the metadata for the different types of data sets you have.

Step 3: Devise Well-Thought-Out Processes and Establish Automation

Establishing processes is crucial for maintaining consistency, especially since the data sets may come from different systems and might even belong to different business departments.

For instance, you can have the accounting department publish data on payroll monthly to the data lake while the HR department publishes employee satisfaction annually after every survey to the data lake.

We highly recommend automating the data sourcing process for time-consuming operations that require a higher frequency of data publishing. You can consider automating the extraction, transformation, and publishing of data to the data lake to save tons of time.

Alternatively, you can automate individual steps only of the whole process.

Step 4: Ensure the Proper Functioning

Setting up a data lake is only half the work. You must also ensure it functions properly.

It’s not about only feeding data into the data lake but also facilitating data retrieval for other systems to make data-driven and better-informed business decisions. Otherwise, the data lake will end up as a data swamp with little to no use.

Step 5: Put the Stored Data into Use

Over time, you’ll start collecting data to your data lake with the right amount of associated metadata. Once that happens, you’ll have to implement different processes with ETL (Extract Transform and Load) operations to drive business decisions.

You can either publish the data to a data warehouse to process it more in connection with different data sets from other systems or feed it directly into data visualization and analytic tools to use the data in whichever way you see fit.