When I learned to code, I, like most people who start, found a textbook and began with exercises. There were little to no introductions to SaaS tools and integrations. Back then, there were fewer popular tools and libraries that teams use today. I assumed all technology solutions would be built entirely on custom code, and the more complex code, the more talented the engineer.
Recently, I’ve learned that as more software tools are built, being a better “engineer” often means picking the right tool for the job and using it well. That could be a programming language, library, or managed SaaS tool.
I’ve learned a lot at Electric about data ingestion. While I’ve written less complex code than expected, I’ve contributed more to our data ingestion efforts than anticipated!
In this post I’ll share a quick primer on data ingestion, its importance for any data analysis, and Electric’s tradeoffs and decisions in our primary data ingestion tool: Stitch.
Background on Ingestion
A popular chart called “The AI Hierarchy of Needs” details the pyramid of needs for generating data insights at a company. The bottom level is the collection of data and the next level is move (aka transformation)/store of data.
To complete a high-quality data project that reaches an upper level in the pyramid, such as analytics, one must complete at least one task mentioned inside the pyramid for each level below analytics. There’s no skipping levels.
For example, let’s say I had analytics data for a website tracked and recorded by Google Analytics and enterprise sales data by Salesforce. I may want to combine these datasets to see how company size correlates with web activity usage. Each service manages collection and storage in their own database. Next, data must be ingested — the move/store into a single source, likely a spreadsheet or data warehouse (a type of database). Next there’s necessary explore/transform steps to prep the data for analysis. Lastly is the aggregate/label level to find correlations in data.
Many people misunderstand or forget about the foundational lower levels of the pyramid. Years back, that was me. I wanted to break into “data science”. I studied the hot topics in the data industry which were data analysis with Python, stats, and machine learning (ML). When asked to complete a data project that required expertise at the learn/optimize level, I floundered. I didn’t have knowledge of the foundational levels of the pyramid.
I think anyone who wants to analyze data, whether you’re on a data-driven team or not, should have a basic understanding of data ingestion. It’s likely this topic will come up at work. Data is typically siloed in several systems at a company.
HR tools such as Workday or Gusto
Payments such as Shopify or Stripe
CRMs such as Salesforce or Close.io
Help desk software such as Zendesk or Help Scout
Web analytics such as Google Analytics or Heap
Databases for web/mobile applications such as Postgres, MySQL and NoSQL
The WSJ reported a couple years ago, “The number of software apps deployed by large firms across all industries worldwide has increased 68% over the past four years, reaching an average of 129 apps per company by the end of 2018, according to an analysis by Okta Inc.”
How do you analyze data from these many SaaS apps and generate insights? First, this foundational question must be answered: how do you repeatedly and efficiently join these sources for analysis? That’s where data ingestion comes in. In industry, data pipelines are built to continuously replicate data from these sources to a data warehouse, a type of database ideal for analytics.
Electric’s Transition to Functional Data Ingestion
I joined Electric in May 2020. At the time, the company was four years old, and grew at an incredible pace to reach 150 employees but had just one data team individual contributor (IC). A year later and that pace has continued; we have 328 employees and two data team ICs.
At Electric, we manage real-time IT operations for companies. Our employees need timely, reliable and actionable insights to help them make informed daily decisions.
We, the data team, are inundated with requests from employees. Our data team’s biggest challenge is to generate many quality insights for those teams given limited resources. In other words, how do we do more with less (resources)?
Electric’s engineering team still uses a microservices architectureTransactional data from multiple web apps is stored in many AWS OLTP databases and DynamoDB. We also collect and store critical data in Google Sheets, Salesforce, Pendo, Zuora (payments) and many more SaaS tools.
About a year ago, the data team was still at one IC. Other teams including operations, customer-facing IT-support resolution, product and engineering had grown drastically. They each had increased needs for analyses yet the data team had limited bandwidth to help.
Similar to most data teams, we wanted to focus our efforts on data-driven insights that could immediately benefit stakeholders. Those analyses would require significant effort to understand the business’s context, clean and transform data into metrics and build visualizations custom to our business. But, the logic for our data pipelines to extract data from sources and load into a destination didn’t need to be customized. Tens of thousands of companies had the exact same challenge to ingest data from popular sources into popular destinations. For example, a transaction Postgres database to a Snowflake data warehouse or Salesforce into Snowflake. At Electric, there was no need to reinvent the wheel of these data pipelines.
The team decided to implement Stitch, a SaaS tool that would fully manage our data pipelines. It was the best solution given the resource constraint. Our expectation with Stitch is that it required minimal effort to implement and maintain, was reasonably priced and was a best-in-class solution.
Here’s a diagram that illustrates how Stitch works for us.
Pros and Cons of Stitch
I’ll cover a few pros and cons of Stitch that have been valuable to Electric.
The common theme with all pros of Stitch is I can spend minimal time on ingestion. True to our data team’s mission, I can spend more time on data platform improvements and analyses that have immediate business value.
With Stitch, it’s quick to create a data pipeline. I can create a new one in under an hour. Alternatively, a custom solution could take weeks to build and likely would be lower quality and buggier than Stitch.
Stitch has hundreds of build connectors for source integrations and destinations that cover popular tools. That includes Salesforce, AWS databases, Google Sheets, Stripe and more. This gives us reassurance we can likely use Stitch for future data integrations needs such as ingestion of data from Pendo and Asana.
Stitch handles a large amount of complex logic without us writing any code. A few examples include connection to sources, orchestration, scheduling, monitoring, schema migrations, and scaling compute appropriately. This has saved us significant time from the alternative of building out a custom data pipeline — essentially an in-house Stitch.
In more than a year of heavy usage, Stitch has been reliable. We’ve spent minimal time on ongoing maintenance. Additionally, there’s best-in-class observability to view outcomes of replication jobs, detailed logs, and notifications to Slack or email on data replication job failures.
Stitch has an intuitive GUI to set up these data replications, schedule replication frequency, and select which tables and fields to replicate from sources. Also, since Stitch is a paid SaaS solution, they have a customer support team. They have great technical chops and have helped solve a few minor problems. This GUI and amazing customer support makes it easy for anyone on the data team to contribute to our Stitch data pipelines without a need for extensive knowledge of data engineering. This has made my life as a data engineer easier. I don’t need to spend as much time on documentation and training to educate others on our data pipelines.
One limitation of Stitch is that it’s a batch processing tool only. The fastest data can be replicated from source to destination is every ~15 minutes. At this time, the delay has been sufficient for most of Electric’s analyses. However, in the future, it’s likely the data team will invest more efforts to build new data pipelines to have closer to real-time data available for reporting.
Another downside of Stitch is that similar to any managed SaaS tool, it can become expensive. Companies pay Stitch based on the count of replicated rows per month. Electric collects minimal data so Stitch is affordable for most of our needs. Our initial plan for replication up to 5M rows per month was $100/month. We didn’t bat an eye at this amount. However, if your company collects massive amounts of data in the order of hundreds of millions of records a month or more, I’d suggest considering alternatives with self-hosting.
Final Stitch Review
If I could go back in time and choose another ingestion tool instead of Stitch, would I? Absolutely not! It’s been a lifesaver and one of my favorite software tools.
I’d recommend Stitch to companies if most of the following conditions are met:
Minimal data engineering resources
Small-ish data, typically a need to replicate less than 100M records per month
Openness to ~15 minute stale insights (not real-time)
Ingestion needs from multiple sources
Remember, Stitch can always manage some data pipelines, but not all. There’s alternative cloud solutions such as Fivetran, Tray and Rivery as well as open-source and self-hosted options with Meltano or Airbyte. These can vary based on availability of connectors, costs, and freshness of data.
With selection of any tool, consider your resource constraints and business requirements to pick what’s best for you.
Thank you all for reading! Feel free to reach out to me at firstname.lastname@example.org if you have questions on anything data related. Stay tuned to the Electric engineering blog for further posts on our engineering systems and projects.