Best Practices: Your Data Architecture and Data Pipeline – a conversation with Allen Eubank, Head of Architecture
Allen & Allen Jr.
Apps have captured the world’s attention and attracted some very bright and inspired people to innovate and iterate in the space. It’s grown to a $693 Billion a year industry and the combination of growth, money, innovation and smart people have made this an incredibly competitive space.
App developers need to constantly increase their edge and the best in the world rely on data to help them make better and more informed decisions.
But data provides its own challenges. So today we’re talking about managing mobile app data and how app developers can architect and build a scalable, reliable data pipeline to stay informed and make the best decisions for their apps.
Allen Eubank is the head of Architecture for AdLibertas. He’s seen a vast number of publisher data architectures during and before his time with the company. We’ve asked him to sit down and share some of his findings and tips on building your own data architecture.
Adam: “So Allen, give us some background and context on what you’ve seen and your experience with mobile data?”
Allen: “Well, early on in the technology consulting space I helped build out the Porsche and Toyota mobile apps but I’d have to say the first actual data architecture I built was probably when I built out a CRM collection tool for content creators on YouTube back in 2014.”
Adam “CRM collection?”
Allen: “Well today you’d call it a data scraper but we had to spin up and manage huge clusters of servers to scrape and gather the content. Then I ran a D2C product company which required the arduous connection of online and offline data into a usable format for delivery.”
Adam: What about with us?”
Allen: “With AdLibertas our first data pipeline was the aggregated ad network reporting – constantly pulling and normalizing huge amounts of reporting data , then the optimization workflows and most recently—and the largest to-date—is Audience Reporting which can spin up 8,000 cores running 64TB of memory to process reports over massive amounts of data. I also get a chance to see and explore our customer’s data systems, so I get the opportunity to see a large amount of solutions and their pluses and minuses.”
Why should app developers care about data?
Adam: “Great, so you’ve been around mobile data for a while. You mentioned data architecture and data pipelines, can you tell us what is the difference and why are they both important for app developers?
Allen: “Essentially the mobile data architecture is the holistic design and setup of all rules and standards used to process and store data.
“The data pipeline refers to the actual information flow and the technologies used to reliably process, move, and transform your data.
“The purpose of both is empowering your teams to have the data to make the decisions they need, while ensuring that it works day-in-and day-out while staying usable and useful.”
Adam: “So when should data become important to a developer?”
Allen: “Your data architecture is planning for how to best collect all the information your app is generating after deployment, once it’s in consumer hands. No app is static, you need to improve and iterate what is your next priority.
Data will help you set that priority.”
Adam: So as a whole, what are some good examples of mobile app data architectures you’ve seen and why?”
Allen: “This is a tough one because everyone will have a different answer. It depends on the schema and for most people it’ll depend on the goals for the data. Do you want it fast, cheap, easy, real-time, etc. And those are usually mutually exclusive answers. A better question is are what good examples of technology we’ve seen.”
Recommended data architecture and data pipeline technologies
Adam: “Okay, what are some examples of good technology you’ve seen?”
Allen: “Well, the core decision is the database. If you’re just getting started we’ve found for most datasets a RDBMS works fine and you don’t have to immediately jump into a big data solution. But as you scale you’ll quickly have to come back to fundamental goals: how fast and how cheap do you want it?
“As your dataset grows above a hundreds of millions rows per day you may want to think about a big data solution. For hosted infrastructure we’ve used Athena, EMR running Presto and most recently Trino a fork of Presto.”
Adam: “What is Trino? And why did you choose it?”
Allen: “Trino is a fork of Presto which is Facebook’s answer to big data. It was developed internally – then open-sourced – as a scalable method to empower internal teams to answer questions they couldn’t otherwise answer because they were limited by existing technologies. They had to query petabytes of information that was otherwise hidden from business decisions. Presto enabled Facebook to answer these questions in minutes.
“Trino follows a coordinator-worker architecture where queries are submitted to a coordinator and that coordinator takes care of executing queries across a worker fleet. This way you can run distributed queries, vastly increasing the speed of queries over large amounts of data.
“A key strength is that Trino allow you to write SQL to pull data from many different data sources, using many catalogs (locations of the data) and connectors (the method of access) allowing you to query across distributed formats. So the same query could run across AWS s3 buckets, a postgres database and a mysql database.
“This approach has not only allowed us to scale effectively but also vastly decreases the cost of data storage. We store compressed data in cold-storage paying 1/10th what you’d pay for Google’s BigQuery or AWS’s Athena.”
Read how AdLibertas uses Trino on their website.
Data architecture challenges and approaches to avoid
Adam: “What about data architecture and data pipeline approaches to avoid?”
Allen: “The single biggest problem we see is not having a centralized ID across your data providers. No shared ID means you can’t centralize a source of truth and many different tables vastly increase complexity of access. Unfortunately most app developers wait until the data ‘is needed’ before planning centralization leading to cobbled together tools, or relying legacy systems that are “good enough” but don’t end up being trustworthy and therefore end up being useless to the business teams.
“The data you care about will guide all other conversations: schema, availability, budget and accuracy.”
Adam: “What are some key challenges mobile app developers have when building out their data architecture?
Allen: “From the beginning it’s difficult to outline the idea and how speed, cost and complexity of access will play together.
“From what we’ve learned scale will always be an issue. Customers end up blowing budgets out the water because their data pipelines are inefficient. As an example, a batched approach to gathering and processing data can be much more cost and resource-effective than streaming data live. It can vastly change your architecture.
“Also, not outlining your schema at the outset can lead to problems – make sure you know what data you care about. Migrations can be painful and it can be difficult and costly to add scope if you miss fundamental metrics or information from the beginning. This means bringing in key stakeholders in product and marketing early on to outline which in-app events should be tracked.”
Hidden Costs: Learning the hard way – One of our customers was exploring with Google Data Studio as an answer to visualization. Unfortunately this customer built a dashboard that ran up an 1,800 EUR charge each time someone viewed the dashboard.
Where to start
Adam “You’ve seen a lot of app developer data pipelines. What helpful information are app developers tracking?”
Allen: “I was talking to one customer who put it well: ‘it’s better to have the data and not use it, than to want the data and not have it.’
“An obvious important in-app event is install events. Most customers use an MMP to track install-source and user-value + campaigns can easily give you access to valuable sources of new users and the ability to find look-alike audiences of profitable users. Next would be early-level engagement events. Are you tracking events that will uncover downstream value or engagement? We have a customer who’s tracking number of games played on day one – this showed them a pretty remarkable increase in value between users.
“However you want to track as much as you can without it negatively effecting your data pipeline. For instance, Firebase includes a default event “screen_view” which is similar to pageview on the web but with apps it fires on every different app screen the customer visits. Depending on the app-type this can be multiple screens in a second and can add up very quickly. Creating 1000+ user-events in an hour, can balloon data-size and this level of granularity will come with a cost.
Adam: Final takeaways?
Allen: “Start on your data pipeline early. Don’t push it off because it’s too hard or you might miss out on key insights. It will only get harder with scale. Once you reach the point of wanting or needing access you’ll likely be wanting to view some historical performance. Measurement of day 1 is much less interesting than having historical data to compare against. Many architectures let you store for cheap and “pay for what you query” such as Amazon’s Athena. This allows you to come back and get answers later once you know which questions you want to ask.”
The AdLibertas management team. Left to right, Adam, Chris, Allen and Kirill on the way to a weekend retreat.
Interested in getting actionable information from your data?
We built Audience Reporting to be an easy way for app developers to combine and manage their data sources. With just a few clicks app developers can import, store and report on their user-level revenue data.
See more on the Audience Reporting site!