4 Limitations of Snowflake
Data
Data
Jul 7, 2024

4 Limitations of Snowflake

Julian Gilyadov

A bit of context...

In today’s data-driven world, cloud data warehouses such as Snowflake and BigQuery have become staples for powering business intelligence dashboards (BI), and aiding thousands of companies in strategic decision-making. These warehouses excel at aggregating data for BI analysis. In fact, we have been using Snowflake to support our BI efforts for years.

Yet, these cloud data warehouses have a critical limitation: the inability to handle real-time data.

And as the demand for real-time data processing grows, this capability is no longer just a luxury, but a necessity for companies to remain competitive in today’s market.

The most impactful data is often only useful when it is acted upon quickly after it is generated. Traditional data warehouses, with their batch architecture, fail to capture this impactful value in time. This growing gap in the market for a new architecture optimized for real-time is the reason we built Airfold.

“In today’s rapidly evolving market, real-time data isn't just a cherry on top; it's the whole sundae.”

Let's deep dive into why cloud data warehouses like Snowflake, for all their merits, fail when it comes to real-time, and how Airfold can turn any cloud data warehouse into a powerful backend for real-time applications.

Note, we'll use Snowflake as an example, but this applies to all cloud data warehouses, and even data lakehouses.

Snowflake’s is a financial avalanche

For any data company, Snowflake is already one of the largest, if not the largest, costs to manage. Every data team knows that Snowflake can become very expensive very fast.

Since Snowflake charges per second of compute, keeping Snowflake running 24/7 will rack up an exorbitant bill. So companies try to keep Snowflake paused for as long as possible. However, you can't run queries on Snowflake when its paused. So to reduce costs, companies accumulate all the queries they wish to run and only activate Snowflake a few times a day to run them all at once. This approach works well for batch and BI scenarios - it’s what Snowflake was designed for.

However, if you intend to use Snowflake as a backend for a real-time application or to serve many external users, you’ll have to keep Snowflake running 24/7.

Moreover, Snowflake cannot handle many concurrent requests - even with multiple clusters, you'll end up needing to invest in adding more costly warehouses to meet a growing workload. Running just three X-Large instances 24/7 will cost you $110k per month.

That’s why Snowflake should never be used as an application backend. It’s costly enough as it is.

Snowflake cannot handle high concurrency & low latency

Even if you decide to simply “throw money at it” and keep Snowflake running 24/7 as an application backend, you will soon encounter constraints related to freshness, latency, and concurrency.

Snowflake processes queries in a job pool. This means that the execution of queries is delayed, so even basic queries that consume practically zero resources, like SELECT 1 can take 1-2 seconds to respond.

In addition, each warehouse can handle only a few concurrent requests. If there are thousands of concurrent users making dozens of API requests per minute, be prepared for either a costly bill or a frustrating user experience.

Developing directly on top of Snowflake is not a feasible solution, even if you pour money into it.

Snowflake is not an application backend

Even if you accept Snowflake's technical limitations as an application backend, you would still need to build and deploy a web service to serve the data. Snowflake can’t publish a query as a “REST API”.

So how is it done nowadays?

Engineers often attempt to pre-aggregate metrics in Snowflake and then synchronize them to Redis or RDS using scheduled Airflow and dbt jobs.

Then they build a web service to expose the data over a secure “REST API”.

However, this approach does not address the underlying concurrency issues with Snowflake, and the scheduled sync jobs greatly compromise data freshness.

Building and maintaining all that infrastructure requires hiring additional engineers and a significant time effort, which can be excessively costly for many companies. And yet, it is still not real-time.

We know this because we have tried it ourselves, and so have many of our users.

Granting access to Snowflake tables requires a lot of red tape

“Seasoned data teams are all too familiar with the financial gymnastics required to keep Snowflake costs from ballooning.”

Every query in Snowflake incurs a cost based on its execution time. As a result, there is a significant risk in allowing unrestricted querying of Snowflake. Even a few unoptimized or innocent queries, such as SELECT *, can rack up quite a bill. Consequently, data teams require approval for anyone to query Snowflake, leading to lengthy and frustrating iteration process.

How to turn any warehouse into an application backend?

With Airfold, any cloud data warehouse can be used as a powerful backend for real-time applications.

Airfold is capable of handling thousands of concurrent requests 24/7, while being 10x cheaper than Snowflake. It achieves this by being specifically designed for real-time.

Airfold and Snowflake are better together. Use Airfold to automatically sync your Snowflake data into Airfold, and convert your SQL queries into APIs capable of handling thousands of concurrent requests.

Think of Airfold as dbt, but for real-time processing. This allows engineers to easily build real-time applications on top of Snowflake data while ensuring that the source of truth data remains secure in Snowflake.

This, of course, means that the served data may not be the most up-to-date, as there’s a delay between each sync job from Snowflake. To react to events as they happen in real-time, Airfold offers a fast ingestion API that allows data to be directly ingested into Airfold. This data can be transformed and served in real-time, 24/7, handling thousands of concurrent requests.