We built a modern data stack from scratch and reduced our bill by 70%

83 points by jchandra 7 months ago

snake_doc 7 months ago

These just seems like over engineered solutions trying to guarantee their job security. When the dataflows are so straight forward, just replicate into pick your OLAP, and transform there.

moltar 7 months ago

I came from traditional engineering into data engineering by accident and had a similar view. But every time I tried to make a pipeline from first principles it always eventually turned out something like this for a reason. This is especially true when trying to bridge many teams and skillsets - everyone wants their favourite tool.
jchandra 7 months ago

our approach wasn’t about over-engineering, we were trying to leverage our existing investments (like Confluent BYOC) while optimizing for flexibility, cost, and performance. We wanted to stay loosely coupled to adapt to cloud restrictions across multiple geographic deployments.
polskibus 7 months ago

What is the current state of the art (open source) when doing oltp to olap pipelines in these days? I don’t mean a one-off etl style load at night but a continuous process with relatively low latency?
- williamdclt 7 months ago
  
  Idk what the state of the art is, but I’ve used change data capture with Debezium and Kafka, sink’d into Snowflake. Not sure Kafka is the right tool as you don’t need persistence, and having replication slots makes a lot of operations (eg DB engine upgrade) a lot harder though.
- txomon 7 months ago
  
  I would say clickhouse

1a527dd5 7 months ago

There is something here that doesn't sit right.

We use BQ and Metabase heavily at work. Our BQ analytics pipeline is several hundred TBs. In the beginning we had data (engineer|analyst|person) run amock and run up a BQ bill around 4,000 per month.

By far the biggest things was:-

- partition key was optional -> fix: required

- bypass the BQ caching layer -> fix: make queries use deterministic inputs [2]

It took a few weeks to go through each query using the metadata tables [1] but it worth it. In the end our BQ analysis pricing was down to something like 10 per day.

[1] https://cloud.google.com/bigquery/docs/information-schema-jo...

[2] https://cloud.google.com/bigquery/docs/cached-results#cache-...

SkyPuncher 7 months ago

I know it's easy to be critical, but I'm having trouble seeing the ROI on this.

This is a $20k/year savings. Perhaps, I'm not aware of the pricing in the Indian market (where this startup is), but that simply doesn't seem like a good use of time. There's an actual cost of doing these implementations. Both in hard financial dollars (salaries of the people doing the work) and the trade-offs of de prioritizing other other.

paxys 7 months ago

The biggest issue IMO is that engineers who work on projects like these inevitably get bored and move on, and then the company is stuck trying to add features, fix bugs and generally untangle the mess, all taking away time and resources from their actual product.
mattbillenstein 7 months ago

Yeah, but you can always make this argument and build nothing - dealing with all the problems of every 3rd party SaaS / PaaS under the sun. Sometimes it's much easier to just build the thing and then you know where its' limitations are and you can address them over time.
TZubiri 7 months ago

>Perhaps, I'm not aware of the pricing in the Indian market
It's approximately 4 annual salaries (non dev)
snovv_crash 7 months ago

High per-unit costs lock you into certain workflows, and a lot of systems have much higher value at much larger scale.
For example, think of the value generated from a fraud detection system that alerts you on a nightly batch job, vs. a fraud detection system that runs once per minute. If you're spending $20k/y on the batch job, you're never even going to consider the 1/min option. If you can decost sufficiently to run it 1/min, whole new market opportunities open up.
naijaboiler 7 months ago

Even having 1 engineer working on this for only 1 month in one year it’s still cost ineffective.

bob1029 7 months ago

When working with ETL, it really helps to not conflate the letters or worry about them in the wrong order. A lot of the most insane complexity comes out of moving too quickly with data.

If you don't have good staging data after running extraction (i.e., a 1:1 view of the source system data available in your database), there is nothing you can do to help with this downstream. You should stop right there and keep digging.

Extracting the data should be the most challenging aspect of an ETL pipeline. It can make a lot of sense to write custom software to handle this part. It is worth the investment because if you do the extraction really well, the transform & load stages can happen as a combined afterthought [0,1,2,3] in many situations.

This also tends to be one of the fastest ways to deal with gigantic amounts of data. If you are doing things like pulling 2 different tables and joining them in code as part of your T/L stages, you are really missing out on the power of views, CTEs, TVFs, merge statements, etc.

[0] https://learn.microsoft.com/en-us/sql/t-sql/statements/merge...

[1] https://www.postgresql.org/docs/current/sql-merge.html

[2] https://docs.oracle.com/database/121/SQLRF/statements_9017.h...

[3] https://www.ibm.com/docs/en/db2/12.1?topic=statements-merge

mulmen 7 months ago

> Extracting the data should be the most challenging aspect of an ETL pipeline.
Why should this be difficult? It’s the easiest part. You run SELECT * and you’re done.
The difficult part is transforming all the disparate upstream systems and their evolving schemas into a useful analytical model for decision support.
- bob1029 7 months ago
  
  Not all data lives in a SQL database. Much of the extraction code I write does things like loading flat files from unusual sources and querying APIs.
  If the source data is already in a SQL store, then the solution should be obvious. You don't need any other tools to produce the desired view of the business at that point. Transforming for an upstream schema is a select statement per target table. This doesn't need to be complicated.
  - mulmen 7 months ago
    
    Yeah I extract a lot of data out of Dynamo. It’s still the easiest part. Change capture just isn’t complicated. You need some basic constructs and then you’re golden. The data mart design phase is orders of magnitude more effort.

cratermoon 7 months ago

AKA The Monty Hall Rewrite https://alexsexton.com/blog/2014/11/the-monty-hall-rewrite

lifeisstillgood 7 months ago

This is a great concept and of course implies the very real idea that you should just rewrite a lot of your stuff anyway …
- cratermoon 7 months ago
  
  > you should just rewrite a lot of your stuff anyway
  Continuously refactoring when adding features, fixing bugs, or dealing with external changes keeps code fresh and free of cruft. Also removing dead code, deprecating and removing little-used features, refining behavior to remove unused edge cases ensures focused and useful systems.

ripped_britches 7 months ago

So you saved just $20k per year? Not sure the context of your company but I’m not sure if this turns out to be a net win given the cost of engineering resources to produce this infra gain

nijave 7 months ago

It can sometimes make sense. If you're trying to optimize cost per customer, for instance. The usual assumption is you want financials to look more attractive to raise funds or you expect to scale customer count and want to keep costs flat-ish
TZubiri 7 months ago

If it's only for cost savings it's a hard sell.
But generally rolling your own has other benefits.
- crazygringo 7 months ago
  
  > But generally rolling your own has other benefits.
  Not for startups it doesn't. The only rolling-your-own they should be doing is their main product.
  Once you get bigger with many hundreds of employees, and existing software starts becoming a measurable blocker, then you can gradually build your own stuff once the tradeoffs actually make sense. But it generally takes a while for that to become the case. For startups, premature optimization isn't the root of all evil, but it's the root of a lot of it.
  - TZubiri 7 months ago
    
    You might argue that it is never worth it to roll anything on your own (which is already an extreme proposition), but to argue that it has no benefits (other than cost?), I think it's either bad reading comprehension or overzeal to jump into the keyboard and type dogma that you should never roll your own and that you should download 1000 dependencies and a dependency manager to manage all those version conflicts.
    
    crazygringo 7 months ago
    
    It's not being overzealous.
    If you're a startup and you need something and it already exists, why would you write your own instead? Even if it's clunky, it saves you time.
    And for startups, speed and runway is everything. Rolling your own is not a luxury you have.
    Especially when you consider how much of the startup experience is changing requirements, pivoting, throwing out code, etc.
    And I've never seen 1,000 dependencies in my life. Sure maybe you have 50. That's entirely manageable. Think how much time you'd lose writing and testing all that stuff yourself...
    
    TheNewsIsHere 7 months ago
    
    Although my company isn’t recognizable as a startup given its age, I can say confidently that the typical advice to avoid rolling your own is not a given.
    Many new founders and businesses fail to consider the supporting systems they need outside the code their product or service is comprised of. This can cause a lot of unnecessary re-engineering later.
    As we built our company we had to make a number of decisions about whether to build our own or go with a third party. At various points we did both, and for us the DIY approach has continuously been the better option over time. That doesn’t mean we write our own code for everything but it does mean we sometimes do, and it means we don’t just go with whatever is the most convenient. An increasing amount of our IT estate is self-hosted applications rather than SaaS apps, for example.
    By way of example, we had started off with a really popular accounting package that everyone uses for everything but which isn’t very customizable (and which doesn’t provide backups). That became untenable after only a few years because we our service demanded a more bespoke approach to linking service and financial data. We knew that going in but we took the same “it’ll do for now” attitude. That ended up creating more pain later. We moved to an ERP platform that we can host (and backup!) and haven’t looked back.
    We almost did the same with our infrastructure management. We had evaluated and used a number of proprietary products and services from AWS, Hashicorp, CI/CD, and RMM vendors.
    Ultimately we created our own stack on the back of FOSS and custom code which took a lot longer to mature but which has proven to be much better for our needs.
    
    TZubiri 7 months ago
    
    There's pros and cons to everything.
    One advantage of rolling your own:
    -minimalism: no code designed for othere -security: not necessarily more secure, but a different risk profile. You are safe from a lot of spearphishing vulns like log4j, but may be open to custom attacks, I think this is an advantage but ymmv. - system knowledge: you can answer questions about your system, to clients and regulators.
    - system flexibility: you understand your system you can modify it as needed. Which is not true if you import 1000 dependencies.
    
    TZubiri 7 months ago
    
    > I've never seen 1,000 dependencies in my life. Sure maybe you have 50. That's entirely manageable.
    50 immediate dependencies add up to much more than 1000 total dependencies.
    If you install a linux OS you are already at 1k. Check your packages
    
    crazygringo 7 months ago
    
    So... you're suggesting startups should roll their own operating system before building their product?
    Because I don't see any other way of interpreting your comment.
    
    TheNewsIsHere 7 months ago
    
    That was surely an example to illustrate how easily you can end up with 1,000 dependencies. Actual numbers vary of course, but the point that I got was about how dependencies spiral.
    They do. Especially for web based applications built on frameworks and libraries that make it easy to pull in dependencies through package management. I used to work for a company that wrote a stack in Go with a TypeScript front end. When you compiled the whole thing there were easily 900 dependencies resolved, and 885 of those weren’t written internally.
    An OS and its package manager are no different, but are a more accessible and less specific example to cite.
    
    TZubiri 7 months ago
    
    You should work in reading comprehension, try reading what's actually there instead of trying to find a hidden meaning
- vood 7 months ago
  
  Rolling your own generally has mainly downsides in the context they are in. 1. This is clearly a small team with very little spend 2. Tomorrow someone leaves and next engineer will have to manage all of this. 3. I don't think they realize that they actually increased cost of this service not decreased it. Now they need to manage their own Kafka monthly. Engineering time is expensive.
  - TZubiri 7 months ago
    
    Yes, rolling your own X has downsides and it has upsides.
    Welcome to tradeoff engineering.

65 7 months ago

Why is data engineering so complicated?

I'm not a data engineer but was tasked with building an ETL pipeline for a large company. It's all just Step Functions, looping through file streams in a Lambda, transforming, then putting the data into Snowflake for the analytics team to view. My pipeline processes billions of rows from many different sources. Each pipeline runs daily on a cron job (maybe that's the key differentiator, we don't need live streaming data, it's a lot of point of sale data).

Whenever I hear actual data engineers talk about pipelines there are always a million different tools and complicated sounding processes. What's am I missing?

dijksterhuis 7 months ago

> What's am I missing?
Hi, data engineer here (of sorts). I'll take a crack.
> It's all just Step Functions, looping through file streams in a Lambda, transforming, then putting the data into Snowflake for the analytics team to view.
So, you're right. Data going from A -> B is the most important part of the job. Like, the pipeline you've got is like the 60-80% critical bit. Get data from systems A through Y and put it into system Z. Without it, there's no point talking about anything else.
But what happens when you need to rollback all of the data from system F for the last week? What if you need to rollback system C for the last 2 days as well? How do you keep track of whether data from systems G, H and I is still "good" quality? How do you handle frequently changing source systems, which could do any one of drop, rename, add columns on a daily basis? Do you ignore those columns forever? Or do you siphon that data off to a special place for manually review? Do you process all previously seen data every single night? If you only process the newest subset of data, what do you do if you've noticed a bug in the pipeline code? Process everything again and spend a whole bunch of money every time we fix buggy pipeline code? Or do a "migration" on already processed data? How do you track which subsets were processed with which version of your pipeline? How did that run get triggered -- was it a glue crawler scheduled cron job, or did someone trigger it manually? Does any of this actually matter? Does anyone even look at the dashboards any more?
Getting the pipeline running is the first bit. With non-volatile/slow-changing source data/systems, you're fine with just the pipeline most of the time. Depending on the volume of the data, you could probably get away with just rerunning the pipeline over all previously seen data when there's been a bugfix and just eat the cost. You don't need anything more complicated than that.
But if the source data/systems are volatile [0], then keeping it working, keeping it accurate and keeping it cheap comes next. And the complexity of that depends on how complex X -> Y is.
A simple "back of a napkin" way to determine complexity is to sit down and ask some questions about the 6 V's: Volume; Velocity; Variety; Veracity; Value; Variability [1]. It sounds like [2] you have high volume, some velocity (nightly), some variety, low veracity, high value and low variability data. There's a lot of data coming from multiple source systems, and it's important data. But it doesn't sound like the changes that much and is all probably all in the same or similar formats [3] and good quality. So... a simple pipeline triggered nightly works in that case.
When things get more complicated, that's when the system moving data from A -> B gets more complicated.
---
Having said all that, there has been a lot of resume driven development in the industry. $LAST_JOB i worked at the previous team thought it would be a great idea to roll their own job orchestrator product [4]. from scratch. in django. not a library that django called, not some django API endpoints that forwarded requests to some Airflow backend. nope. a job orchestrator written from scratch in django.
Because "bare metal" apparently.
They were also obsessed with this idea of reproducibility, but what they actually wanted was idempotency... but that's a story for another reply.
---
[0]: even if each upstream is slow to change, the cumulative effect could be there is at least one upstream change breaking something every day, so the cumulative effect is that the upstreams together are volatile even if individually they are not
[1]: https://www.geeksforgeeks.org/5-vs-of-big-data/
[2]: I have no idea if any of this is accurate, this is literally me taking a punt based on like a sentence in an HN comment.
[3]: by formats i don't mean different table structures, i mean table vs raw binary files vs JSON vs raw text with no delimiters
[4]: all those engineers had left by this point due to covid furlough and some other stuff

jchandra 7 months ago

We did have a discussion on Self vs Managed and TCOs associated with it. 1> We have multi regional setup so it came up with Data Sovereignty requirements. 2> Vendor Lock ins - Few of the services were not available in that geographic region 3> With managed services, you often pay for capacity you might not always use. our workloads were often consistent and predictable, so self managed solutions helped in fine tuning our resources. 4> One og the goal was to keep our storage and compute loosely coupled while staying Iceberg-compatible for flexibility. Whether it’s Trino today or Snowflake/Databricks tomorrow, we aren’t locked in.

thecleaner 7 months ago

This is a little bit of a word soup. Its hard to see why the various redesigns were done without a set of requirements. I dont get why you'd trigger Airflow workflows for doing CDC. These things were designed for large scale batch jobs rather than doing CDC on a some Google sheets. Either way without scale numbers its hard to why PG was used or why the shift to BigQuery. Anyways the site uses Hugo, which actually sticks out for me.

reillyse 7 months ago

How much did this cost in engineering time and how much will it cost to maintain? How about when you need to add a new feature? Seems like you saved roughly 1.5k per month which pays for a couple days of engineering time (ignoring product,mgmt and costs related to maintaining the software)

grayhatter 7 months ago

No idea how many hours to build, but I maintain something similar, (different stack though) and its so trivial I don't even count the hours, so it's probably about 1/2 a day to maintain that every 3 months?
Even if you needed to invent a new feature, you could invent a months worth of features every year and still save money.
- nxm 7 months ago
  
  Keep components up to date & dealing with schema changes will easily take more than a half day every 3 months.
reillyse 7 months ago

I know this is off the topic of the actual post but I'm confused as to why I've been downvoted, seems other people have since made similar comments but my comment has been downvoted. Help me out here, why?

vivahir215 7 months ago

Good read.

I do have a question on the BigQuery. i f you were experiencing unpredictable query costs or customization issues, that sounds like user error. There are ways to optimize or commit slots for reducing the cost. Did you try that ?

jchandra 7 months ago

As for BigQuery, while it's a great tool, we faced challenges with high-volume, small queries where costs became unpredictable as it is priced per data volume scanned. Clustered tables, Materialised views helped to some extent, but they didn’t fully mitigate the overhead for our specific workloads. There are ways to overcome and optimize it for sure so i wouldn't exactly put it on GBQ or any limitations.
It’s always a trade-off, and we made the call that best fit our scale, workloads, and long-term plans
- vivahir215 7 months ago
  
  Hmm, Okay.
  I am not sure if managing kafka connect cluster in too expensive in long term. This solution might work for you based on your needs. I would suggest to look for alternatives.
- throwaway7783 7 months ago
  
  Did you consider slots based pricing model for BQ?

tacker2000 7 months ago

Is Debezium the only good CDC tool out there? I have a fairly simple data stack and am looking at integrating a CDC solution but I really dont want to touch Kafka just for this. Are there any easier alternatives?

gunnarmorling 7 months ago

You don't need Kafka in order to run Debezium. While that certainly is the most common deployment, you can use Debezium also with things like Kinesis, Google Cloud Pub/Sub, or even plain HTTP sinks, all via Debezium Server (https://debezium.io/documentation/reference/stable/operation...). You even can embed it as a library into your application for the highest degree of flexibility.
That said, streaming events into Kafka (or something like it), rather than pushing change events directly to destinations like Snowflake has big advantages when it comes to reprocessing data, setting up additional consumers, etc.
rockwotj 7 months ago

Redpanda Connect is a yaml file that has a cdc input and you can process the data and send it anywhere, no Kafka required.
I’ve seen postgres cdc directly being written to Snowflake.
https://docs.redpanda.com/redpanda-connect/components/inputs...
nchmy 7 months ago

Conduit.io is where it's at. FAR more source and destination connectors, easier to deploy etc... Pair it with NATS to replace Kafka's mess
moltar 7 months ago

AWS has DMS with a serverless option which is great for CDC
- nxm 7 months ago
  
  Except when a new column is added upstream and DMS inserts the new column not at the end but before the dms timestamp fields, ruining the index based loads of DMS files

rockwotj 7 months ago

Why confluent instead of something like MSK, Redpanda or one of the new leaderless, direct to S3 Kafka implementations?

thecleaner 7 months ago

I think they do mean Kafka. Anyways theres connectors from Kafka to a bunch of things so I think its a reasonable choice.

slake 7 months ago

What's the database that dbt is connected to in this scenario?

mosselman 7 months ago

You’d think that pushing all of the data into any ldap database, but especially some of the newer postgres based ones would give you all the performance you need at 10% of the costs? Let alone all the maintenance of the mind boggling architecture drawing.

throwaway7783 7 months ago

.. how many engineers?

jchandra 7 months ago

2

moandcompany 7 months ago

> "We are a fintech startup helping SMEs raise capital from our platform where we provide diverse financial products ranging from Term Loan, Revenue Based Financing to Syndication, we face one unique data challenge: Our data comes from everywhere."

Reading carefully: the result of this work yields an expected $21,000 USD in annual operating cost savings for infrastructure services.

Is this resume driven development?

What was the opportunity cost of this work? Is the resulting system more or less maintainable by future employees/teammates?