Unlocking the power of DataOps
In today's fast-paced business world, organizations are constantly looking for ways to improve their operations and stay ahead of the competition. DataOps is an approach that has gained popularity in recent years as a way to manage and develop data solutions in an efficient and reliable manner. As a buzzword peaking in Gartner's data management hype curve, everyone wants their share of it!
One of the main differences between DataOps and traditional software development is the complexity that comes with working with data. Data is constantly changing and can come from a wide variety of sources, making it a challenging task to develop, manage, and maintain data pipelines and products. Additionally, data development is often downstream and dependent on the underlying business applications, which can add an extra layer of complexity.
However, organizations can achieve greater efficiency and reliability in their data operations by understanding the unique challenges that come with data development and implementing the right tools, processes, and methodologies.
Examining and unlocking the power of DataOps through a series of blog posts
DataOps is still very much an emerging field. With new people joining our team over the years, we realized that everybody understands DataOps slightly differently. To get an overview of the DataOps methodology and its key principles, tools, and best practices, we set out to create a series of blog posts focusing on different aspects of DataOps.
We'll publish the blog series over the coming weeks. If you don't want to wait, you can also download the whole story as a whitepaper.
The evolution of data management
Before jumping into actual DataOps, let's spend a moment on why it is needed now, more than ever. This blog provides background on the changing landscape in the data management and data development industries that created the need for DataOps. We’ll also very briefly discuss the different approaches different organizations have taken to managing complexity in their architecture.
In recent history, data management has evolved significantly with the shift from traditional on-premise data warehouses to data platforms built on the public cloud. This change is driven by the increasing volume and variety of data, and the need for advanced analytics and AI/ML.
The architecture of a modern data platform has become more complex due to the use of diverse SaaS/PaaS services, and the need to manage the entire data value chain. There are different approaches to face this complexity; some are more managed than others.
Data Platform or Data Warehouse architecture, whatever you want to call it, has evolved quite drastically in the past ten years. We have seen a jump from traditional on-premise data warehouses with "one-size-fits-all" DBMS systems and ETL software to data platforms built on the public cloud leveraging SaaS/PaaS services with endlessly scalable object storage and database management systems designed for huge analytical workloads.
The volume and variety of data have increased exponentially. With it, the need and desire for data and analytics have shifted the landscape from traditional business intelligence to a diverse set of use cases like advanced analytics AI/ML, and all other kinds of data products.
Today, data provides value throughout its whole life cycle independently and not just as a part of a specific business application. This, together with the ability to buy services rather than investing in infrastructure, has made the architecture of a modern data platform more complex.
The role of data governance and architecture in DataOps
DataOps does not really state what kind of architecture you should have. However, by nature, your architecture should support incremental development and have distinct environments for, at least, development and production.
Because of the nature of data platforms, the probability of having a wide variety of data, including sensitive data, in your data platform is quite high. For this reason, especially when working with the public cloud, you should put high-level data governance and architecture principles in place before starting the actual development.
This includes ensuring the security of the data platform by having identity and access management in place and secure network architecture, ensuring the scalability of your architecture, and at least giving a thought for the cost management. Your data platform should also be aligned with the company compliance policies from the beginning.
From a data development perspective, after the big architectural decisions are made, the development phase is mostly about producing new content in the platform in the form of new data sets and applying different kinds of business rules and requirements to them.
When building new functionalities or the requirements differ from earlier ones, the architecture work can and should be done incrementally like all the rest of the development. When bringing in new data sets, you should always validate compliance requirements against existing ones to ensure everything goes as defined. This is of course easier, if there are clear roles and responsibilities defined for the data team, the data owners and the data consumers.
One thing to note, especially when operating in the EU, is general data protection regulation. Keep in mind that all GDPR-related data is much more expensive and requires to be handled with care.
It is also a good idea to validate your data platform governance policies with an experienced cloud engineer if the team lacks experience working in the public cloud.
Approaches to managing complexity in the architecture
The variation and complexity in services bought and built have increased the skill set required to develop and operate these platforms and how to manage the whole value chain. To handle the situation, we have at least a couple of different approaches:
- There is no obligation to go to the cloud. If your data warehouse is purely running financial reporting, for example, and after careful assessment, you can't see any other real requirements for a diverse set of data products now or in the near future, or for any reason you still have to run the on-premise data warehouse anyway, you might not benefit going to the public cloud.
- You can make an informed decision to run 'spaghetti architecture' on your data platform. This means you give your developers the freedom to create ‘quick and dirty’ solutions to fulfill the requirements. When refactoring or maintenance is needed, the solution is discarded and redeveloped rather than spent time fixing it. In this post, we won’t go into the details of this option either.
- Adopt a disciplined and agile way of data development to ensure continuous and predictable value creation and delivery by utilizing DataOps practices and tools in the development and operations of your data platform. In later blog posts, I will go through some key aspects of DataOps in more detailed manner.
On a side note, we want to emphasize that there exists an endless combination of different options, and you should always pay extra attention to privacy and security, whatever method or approach you choose, especially when going into the public cloud. The thing is, that there is no one-size-fits-all solution and you should always try to figure out the requirements and needs your organization has. If you are working with multinational multidomain company where only thing in common between business units are finance and hr, problems and solutions are probably different than for a company with five data developers and one domain.