Against all odds, FindHotel keeps on growing steadily in pursuit of our mission: "to get every customer the best accommodation deal worldwide“. A big part of this success is due to our ever evolving data tracking and analytical capabilities. Part I of this series covers a high-level overview of the company's past and current data architecture, the Data Mesh paradigm and the first steps towards a new Data Architecture. Part II will dive a little deeper into the vision and design of what we are building, next features, and the types of engineers that we are looking for to help us build and evolve our new data architecture. If you’re a Data P.O., Data Engineer, Analytics Engineer or BI Engineer, we're hiring!
Where were we?
Over the first 10 years, FindHotel went from virtually zero data to a fairly sophisticated architecture with multiple internal and external data sources, streaming and batch pipelines, consumers, sinks, machine learning models, etc... and we're very proud of it. It served us well, so far. But the company's data needs have been growing faster than the centralized Data Team can manage, so it was time to regroup and reorganize.
Main pain points
- Since the Data Team was completely overloaded, client teams that had the expertise and resources to process their data naturally requested ownership over sections of their data pipelines, while other teams resorted to hacks to shortcut access to raw data, which led to inefficiencies, increased costs and less than optimal use of their time.
- Data validation tests implemented manually, static documentation, poor data quality and lack of standardization. These issues generate bugs and emergencies that steal a significant chunk of the team's time.
- Pretty much all the data flows through the same cloud account, set of pipelines, data warehouse and visualization tools. This makes the pipelines less flexible, harder to manage and attribute costs.
Not surprisingly, I’ve seen similar issues on a previous job:
- Live communication, emails, notes and messages get forgotten, buried and lost.
- Static documentation gets stale.
- Automated tests are usually restricted to a single system and they check if builds are compatible with that system’s requirements. But there’s no insurance that those requirements are the same for all other systems that handle that data.
Back then, my old team realized that we needed a central source of truth for event metadata, so we solved part of the problem by creating a repository of event definitions to be used by all stakeholders.
Based on that idea, FindHotel’s Data Team started to draft a more ambitious solution:
- Why not go one step further and add automation into the fold?
- Why not generate resources and documentation dynamically based on metadata?
- Why not add organizational metadata to reflect ownership and automate internal cost attribution?
And, like that, the Data Governance System, or DGS, project was born.
DGS and the Data Mesh
The initial goal was to ensure that all relevant systems were always in sync with the current dataset schema definition, but new ideas kept coming:
- Centralized tool to define event schemas, business rules, validation rules and metadata.
- Ability to serve metadata to other systems.
- Ability to trigger deployment of infrastructure and update documentation.
- Ability to manage access policies reflecting data ownership, facilitating cost attribution.
Right about that time, we first heard about the concept of the Data Mesh, made popular by Zhamak Dehghani’s groundbreaking paper "How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh“. We were surprised by how much the Data Mesh paradigm was aligned with the ideas we were discussing, so it had a great impact on the way that our new project developed.
So… What is a Data Mesh?
Broadly speaking, a Data Mesh is the application of Distributed Domain Driven Architecture, Self-serve Platform Design, and Product Thinking to Data. It summarizes the ongoing efforts of several enterprises to democratize data at scale by tackling, what she calls, the Data Lake "failure modes“. I highly recommend reading the paper for more details, but these modes are:
- Centralized and monolithic
- Data lakes evolved to a domain agnostic instead of domain oriented data ownership.
- Proliferation of data sources and use cases eventually makes this architecture unsustainable.
- Coupled pipeline decomposition
- New data sources frequently affect all the stages of the data pipeline, which impacts velocity.
- Siloed and hyper-specialized ownership
- Data is usually owned by centralized team of Data Engineers that often have little domain knowledge, that depend on data source teams with little incentive to provide good quality data and struggle to meet the needs of frustrated client teams that fight for priority in the Data Team’s backlog.
At a very high level, the proposed paradigm shift consists of:
- Data as a Product
- Break down the data lake monolith into Domain Data Products, that is, domain specific data pipelines owned by the domain teams.
- Data Products should be
- Distributed ownership with centralized governance
- Data infrastructure as a platform
- Ideally, every domain agnostic aspect of the data pipelines should be provided by the data platform, so domain teams can focus as much as possible on their products, not on the pipeline implementation.
Ok… But where do we start?
It’s probably clear by now that shifting to the Data Mesh architecture demands a huge company-wide effort, that, for obvious reasons, is not an easy sell. We needed a place to start. A use case that wouldn’t disrupt our roadmap too much, but that could serve as a Proof of Concept (PoC). A small success to help push forward the rest of the project, or at least parts of it. At this time our vision for the DGS was something like this:
- Domain teams design and update dataset metadata using a graphical interface that produces a contract for that dataset in a centralized repository.
- As soon as the new/updated contract is pushed to the repository, the change is broadcast to all registered domain systems, that may trigger their internal CI/CD routines to respond accordingly.
- Possible actions in the domain systems could be:
- Deploy the necessary resources to incorporate the new or updated data source. (no human intervention)
- Generate Pull Requests in the domain repos with the minimum necessary changes to incorporate the new or updated data source. (light human intervention)
- Notify the domain stakeholders that their action might be required. (full human intervention)
Some more details about the pain points we wanted to tackle (WARNING: Geeky language ahead!):
- The data team lacked the resources to keep up with the proliferation of data sources and consumption use cases.
- Almost all raw datasets arrive in the JSON format.
- They need to be flattened into tables for efficient consumption by the analysts.
- With the increasingly long wait times, analysts simply started using the raw data, increasing compute costs.
- The quality of the incoming data was poor.
- Event schema documentation was static, manually updated and prone to get stale.
- The little data validation we had was manually implemented and prone to get stale.
- Ownership of the incoming data was unclear.
- We lacked standardization for many data points.
Automatic flattening of events seemed like the most impactful choice, but would also require a larger effort that would be hard to justify with our already packed backlog. Automatic data validation would be a quicker PoC.
At FindHotel, once per month we have our "Ship-it Days“, where everybody takes 1-2 days to work on something that they find that could be useful or interesting. So on a rainy Thursday of July 2020, I decided to give it a try. To implement the automatic, out-of-the-box data validation I needed to define a few things:
- The contract would have to be in an easily consumable format, ideally an industry standard.
- Data validation can be applied on many stages of the data pipeline, so I had to pick one.
- What to do with data that doesn’t pass the validation?
- A single source of truth for the dataset specifications (contracts).
- The contract would have to reside in a central location accessible by the necessary entities.
- Schemas evolve, so the contract would have to be versioned.
- New contracts and updates would have to trigger the deployment of the validation resources.
I had one day to make this thing work, so I went for the shortest path:
- JSON Schema is widely used for JSON validation, it has libraries in most modern programming languages and is quite simple to use, so it was an easy choice, for contract definition.
- The most obvious stage to apply the validation would be in the AWS Lambda function that processes incoming events from the Kinesis Firehose and store them in S3 in a "valid" and "invalid“ locations. But, for non-technical reasons, that project was not a good option for this PoC, so I decided to validate events inside Snowflake.
- I wanted to make it easy to get validation stats, so I decided to add two new columns to our raw table in Snowflake:
- is_valid: A self-explanatory boolean flag.
- errors: An array containing all validation rules that the record failed to pass.
- GitHub repositories provided most of the features and automation I needed for my PoC, so I created a repo for the contracts with a folder structure that reflected the data domain and version.
- Merges to the main branch would trigger a Travis CI build that executes the Node.js script and deploys the UDFs in Snowflake.
Sequence diagram of the PoC
I’ve made some performance tests and the results were better than expected. I was concerned that the validation could degrade the ingestion performance too much, but the difference was barely noticeable. I ❤ Snowflake.
It was pretty cool watching the whole process of pushing a JSON Schema contract to the repo and a few seconds later having a brand new validation function ready to use in Snowflake.
On the next post…
On Part II we’ll dive a little deeper into the vision and design of what we are building, next steps, and the types of engineers that we are looking for to help us build and evolve our new data architecture. If you’re a Data P.O., Data Engineer, Analytics Engineer or BI Engineer, we're hiring!