As someone who has consulted with companies helping them build out a datalake, I always seem to…

1 min readMar 25, 2023

As someone who has consulted with companies helping them build out a datalake, I always seem to come to this conclusion: for most companies, Databricks is the "easy button".

It always depends on context, like do you need deltalake, or iceberg, or Hudi, etc., and do you have the skills to build out things natively in AWS (and maintain it), or would EMR be the better answer. Glue can work for many companies and use cases, and I will always say start there, especially if you are just getting started with "big data". Kinda back to that YAGNI design principle.

Another thing to note is that if you look at AWS' progression in the D&A offerings, they continue to mature their products. Hence we are now at Glue v4 and you have EMR Serverless for those use cases. And now we are seeing Datazone, so I think the catalog issue goes away very soon as well, because lineage is near the top of the Datazone team's list of top priorities. So you also have to take that into consideration, which is why I like the native AWS approach, as a starting point. At a point in time, the analysis might favor a non-native approach, but you have to remember that these are long-running projects and time is a slowly changing dimension. And I will typically bet that AWS has more financial resources to commit to product development than any other ISV/startup in the space.

With all that said, usually, the easy button answer is Databricks.

Written by Nathan Hanks