On The Evolution of FindHotel's Data Architecture - Part I we covered a high-level overview of the company's past and current data architecture, the Data Mesh paradigm and the first PoC we’ve made towards a new Data Architecture. On this post we’ll dive a little deeper into the vision and design of what we are building, next steps, and the types of engineers we are looking for to help us build and evolve our new data architecture. If you’re a Data P.O., Data Engineer, Analytics Engineer or BI Engineer, you’ll find some info about recruiting at the end of this post.
DGS and the Data Mesh
The automatic validation PoC was cool, but far from being production ready. We needed not only to improve it, but also put it in the broader context of the DGS, document our vision and convince others that the project was worth pursuing. We started brainstorming and documenting our ideas. One of the main challenges during that phase was to keep the scope of the project from getting too large, to focus on what would be achievable in the foreseeable future and prioritize accordingly.
As a first step we decided to improve the PoC and make it more robust to serve as a template for future automation features. Automatic flattening should be next. Then we should prioritize, discuss with stakeholders and design solutions for endpoints, streams, lambdas, buckets, alerts, monitoring, orchestration, GUI, dashboards, etc…
Wait a minute!
“Isn’t there a close enough solution to this problem available in the market?”
Engineers tend to get excited with new toys and sometimes we need a little shake from "normal people”. We looked a bit for "off-the-shelf” solutions for specific parts of the DGS, but they either lacked features, flexibility, were too expensive or would require significant effort to integrate and maintain. At the same time, another FH team wanted to fetch some data from a third party API, but lacked the skills to do it. They knew that waiting for the already overloaded Data Team to do the integration would take too long, so they found some commercial services that promised to do the job and requested the company’s approval to use them. Some other teams had similar problems, so we were asked to run some trials on these integration services and evaluate if they were suitable and within budget. We went for a two-week make vs. buy analysis of a few well known data ingestion services. We had a good experience talking to these companies and their products are very good for many use cases, but, unfortunately, they weren’t as flexible as we wanted and lacked some features that we needed, like smarter, configurable handling of data quality issues and the ability to be integrated into a broader automation workflow.
Improving the PoC
We got the green light to proceed with our plans, so we went straight to making the automatic data validation PoC production ready. It had a few issues that we needed to fix, improve or validate:
- Decouple the Snowflake integration logic from the contract repository.
- Design a generic integration standard so that any target system could plug into the Metadata Registry.
- Evaluate alternatives for metadata storage other than GitHub.
- The company was abandoning Travis CI, so we should do the same.
- Refactor and clean the whole implementation.
Decoupling the Logic
The contract repository and the Snowflake integration are completely independent domains that shouldn’t be together. The new contract repo was named the Metadata Registry (MR) and the initial idea was that it should be able to:
- Store dataset metadata.
- Store target systems metadata.
- Notify only relevant target systems of changes in the metadata.
- Be monitored.
- Send alerts to the configured channels in case of errors.
The Snowflake integration should be a standalone project able to:
- Receive notifications of changes in the metadata.
- Process only the relevant changes in metadata and deploy all the necessary resources to Snowflake.
- Be monitored.
- Send alerts to the configured channels in case of errors.
The idea was that any target system that desired to "listen“ to the MR should build an integration plugin.
Designing a Generic Plugin Integration
For the MR to be able to notify another system, it needs some information about it. We decided to start very simple and establish that the plugin registration to the MR would consist of putting a configuration file with the necessary parameters for the notification of changes. Analyzing the target systems we had in mind, it became clear that we would need two types of configuration parameters: "universal" and “plugin specific". The universal parameters are mandatory and their sole purpose is to inform the MR which plugin it needs to notify and how. The plugin specific parameters give a higher level of flexibility and sophistication to the plugin developers. For instance, we might want plugin A to always deploy some resources for every metadata configuration available, while plugin B might only deploy its resources conditionally, based on the event configuration.
We designed the plugin integration file with a future system in mind, with a management console where the user would define a dataset metadata following a "setup assistant“ or "installation wizard”. Target plugins would be available as setup steps for users with the right domain privileges and that plugin’s parameters would be requested for configuration.
example of a plugin registration file
example of plugin specific dataset configuration
Alternatives to GitHub + Abandon Travis + Refactoring
We’ve considered some alternatives to GitHub, but we quickly realized that it suited our needs very well, at that stage. We’ve experimented with GitHub Actions and it turned out to be a great replacement for Travis CI and the plugin integration was refactored to rely heavily on them. So the new workflow became:
- Changes to the metadata are pushed to the
mainbranch of the MR, which triggers the local
- It analyzes the changes that were made to determine which plugins should be notified, then it remotely triggers each plugin’s default
- The plugin’s
fetch-changesworkflow then uses a default action provided by the MR that analyzes the changes to determine which resources need to be redeployed.
- The target plugin then processes all the necessary metadata and deploys the resources in production.
- In case of any errors, an alert message is sent to the Data Team’s emergency Slack channel.
Sequence diagram of the metadata broadcast
Once the automatic validation was production ready, we started thinking about the next step. As mentioned in the beginning, automatic flattening could be a game changer and now that we had the basis, we decided to go for it.
Until now, most of our ETL process is done using dbt models. Creating dbt models and dbt Cloud jobs for flattening, normalizing, transforming, enriching and exporting event data is still one of the most frequent tasks under the responsibility of the Data Team. But our goal now is far simpler than a full ETL: quite often, just exposing the new data in flattened format in a Snowflake table is enough to enable the domain teams to quickly move forward. Automating the data flattening will not only make the Data Team happier, but also free the domain teams to add new data sources and immediately be able to analyze it, without depending on another team, hacking their way through raw data or increasing costs unnecessarily.
The automatic event flattening could be implemented in a multitude of ways, so we selected the most promising ones to create PoCs and compare them in terms of usability, level of necessary human intervention, required expertise, flexibility, complexity, cost, learning curve, performance, etc… I won’t go into too many details, but we analyzed a few variations of each of the following pipelines:
- S3 > Snowpipe > Flat Table
- S3 > Snowpipe > Raw Table > Snowflake Stream & Task > Flat Table
- S3 > External Table > Snowflake Stream & Task > Flat Table
- S3 > Snowpipe > Raw Table > dbt model in the domain repo > Flat Table
- S3 > Snowpipe > Raw Table > dbt model in a DGS repo > Flat Table
- S3 > Athena > Lambda > Flat Data in S3
- S3 > Glue Spark Job > Flat Data in S3
- Kinesis > Glue Spark Streaming Job > Kinesis Firehose > Flat Data in S3
This experiment resulted in SWOT matrices that helped us decide the way to go:
S3 > External Table > Snowflake Stream & Task > Flat Table
- Easy deployment, very similar to the validation UDFs
- We know the technology well
- Orchestration within Snowflake
- Simple flattening
- By far the fastest to implement
- Little overhead
- Snowflake is working on many integrations with external APIs
- Low time to market
- Facilitates the adoption of the new platform by the domain teams
- Limited monitoring
- Limited alerting capabilities
- We lose the out-of-the box data lineage and documentation features of dbt.
- High coupling
Tool might limit customization/integration with other tools
We’re on the verge of deploying the automatic flattening to production and, after stabilizing it, we have a long list of fun and exciting things ahead of us, like:
- Design and develop new plugins for automatic deployment of AWS infrastructure, documentation, data lineage, cataloguing, etc…
- Integrate cloud and data warehouse role and identity management.
- Support the integration of domain systems with the Metadata Registry.
- Support the domain teams in their journey towards data maturity and ownership.
- and much more…
Come join us!
Our company is growing fast and we’re looking for skilled people to join our very diverse and fun team to make this vision a reality. We’re actively hiring for many positions, but we’re especially eager to find a Technical Data P.O. and a Senior Analytics Engineer to join us remotely or by relocating to beautiful Amsterdam. If you fit the profile, don’t hesitate to apply for the job! Follow the links if you want to know more about our career growth model and recruitment process.