The Many Data Problem: Is Your Company Struggling with too much Data?

Photo by Gustavo Alejandro Espinosa Reyes / Unsplash

Big Data. This was the hot topic back then that a lot of engineers had to solve. When I started my career in 2012, processing terabytes of data was a challenge. We had limited compute power that could fit our on-premise setup. Nowadays, it wouldn’t be surprising to see petabytes or exabytes of data being handled on a single data pipeline in a cloud data warehouse, where you can scale almost infinitely as long as you throw money into them.

This mean the Big Data problem is solved.

However, as I spend more time talking to colleagues and friends working in data, I’ve noticed an emerging problem space that might be present in your organization as well. Let’s call it the Many Data problem, and there’s a factor that I think contributed to this problem.

Data creation has become easier. Ten years ago, data creation was limited to ETL Developers (we weren’t called Data Engineers back then). You also needed to partner with Data Architects and Data Modelers to design the schema of your table before you started developing. This is not the case anymore; DBT and similar frameworks have made it easier to create datasets, and it won’t slow down anytime soon.

Additionally, Gen AI has triggered a necessity to gather more data. Even non-data users before became interested in gathering more data. Everyone’s curiosity about what’s possible became wider and wider. Furthermore, several companies and different businesses now rely on reports and forecasts created by Data Analysts and Data Scientists to make their most critical business decisions.

However, with the abundance of data comes new challenges. If your company is struggling with the Many Data problem, here are five things to check:

Your datasets are not interoperable. Do different data teams in your company have multiple field names to represent the same information? Do they use userId, user_id, uid? Then yes, your data is not interoperable, and you need to start standardizing.
You have more dashboards that don’t provide value. Are there more dashboards than employees? Do half dashboards have 0 views in the past 90 days? Do you have multiple dashboards to show your company’s revenue or new customers that signed up? Then yes, your dashboards don’t provide value, and you need to clean this up.
Your data consumers now want to have a data governance team. Did your team notice that there was too much freedom in creating data? Do teams now care about data contracts? Then yes, you probably need to start putting effort into data governance.
Your cloud data warehouse cost is increasing, and you don’t know why. Is it hard to justify the cost because there is no tangible value created? Do you have proper ways to attribute cost per user per team per query? Are there multiple heavy jobs running but no one consuming the output of it? Then yes, your cloud warehouse cost needs to be optimized, and you need to have cost observability to pinpoint the expensive jobs.
DQ and SSOT became the new abbreviations. Do multiple conversations and documents mention Data Quality and Single Source of Truth? Are backend and software engineers now involved in the same conversation? Then yes, you might need to improve data quality and have a single source of truth. Both of these are the side effects of focusing on quantity over quality.

To solve the Many Data problem, companies need to focus on improving data interoperability, reducing the number of dashboards that don’t provide value, implementing data governance, optimizing cloud data warehouse costs, and improving data quality than quantity. By focusing on these areas, companies can better manage their data overload (Many Data problem) and ensure that their data is providing value to their business.

If this is something that you can relate to, I am happy to have a chat and exchange ideas on how to solve this. I might have an idea or two that might help.