Friday, January 14, 2022

Ruminating on Snowflake Architecture

 The following video is an excellent tutorial to understand how Snowflake can perform both as a Data Lake and Datawarehouse.

https://www.youtube.com/watch?v=jmVnZPeClag

The following articles on Snowflake are also worth a perusal:

https://www.snowflake.com/workloads/data-warehouse-modernization/

https://www.snowflake.com/guides/data-lake

The following key concepts are important to understand to appreciate how Snowflake works:

  • Snowflake separates compute with storage and each can be scaled out independently
  • For storage, Snowflake leverages distributed cloud storage services like AWS S3, Azure Blob, Google Cloud Storage). This is cool since these services are already battle-tested for reliability, scalability and redundancy. Snowflake compresses the data in these cloud storage buckets. 
  • For compute, Snowflake has a concept called as "Virtual warehouse". A virtual warehouse is a simple bundle of compute (CPU) and memory (RAM) with some temperory storage. All SQL queries are executed in the virtual warehouse. 
  • Snowflake can be queried using plain simple SQL - so no specialized skills required. 
  • If a query is fired more frequently, then the data is cached in memory. This "Cache" is the magic that enables fast ad-hoc queries to be run against the data. 
  • Snowflake enables a unified data architecture for the enterprise since it can be used as a Data Lake as well as a Data warehouse. The 'variant' data type can store JSON and this JSON can also be queried. 
The virtual datawarehouse provide a kind of dynamic scalability to the Snowflake DW. Snippets from the Snowflake documentation.

"The number of queries that a warehouse can concurrently process is determined by the size and complexity of each query. As queries are submitted, the warehouse calculates and reserves the compute resources needed to process each query. If the warehouse does not have enough remaining resources to process a query, the query is queued, pending resources that become available as other running queries complete. If queries are queuing more than desired, another warehouse can be created and queries can be manually redirected to the new warehouse. In addition, resizing a warehouse can enable limited scaling for query concurrency and queuing; however, warehouse resizing is primarily intended for improving query performance. 
With multi-cluster warehouses, Snowflake supports allocating, either statically or dynamically, additional warehouses to make a larger pool of compute resources available". 

No comments:

Post a Comment