Snowflake Tutorial Guide – BMC Software | Blogs https://s7280.pcdn.co Thu, 27 Oct 2022 10:09:58 +0000 en-US hourly 1 https://s7280.pcdn.co/wp-content/uploads/2016/04/bmc_favicon-300x300-36x36.png Snowflake Tutorial Guide – BMC Software | Blogs https://s7280.pcdn.co 32 32 Snowflake 101: Intro to the Snowflake Data Cloud https://s7280.pcdn.co/snowflake-intro/ Thu, 10 Jun 2021 08:09:18 +0000 https://www.bmc.com/blogs/?p=49855 With data’s consistent rise in volume and velocity, organizations seek solutions to process big data and any related challenges. One of the first decisions that organizations take? Adopting a cloud-based model that offers flexibility, scalability, and high performance. Snowflake is one cloud-based data warehouse platform that is gaining popularity thanks to its numerous features and […]]]>

With data’s consistent rise in volume and velocity, organizations seek solutions to process big data and any related challenges. One of the first decisions that organizations take? Adopting a cloud-based model that offers flexibility, scalability, and high performance.

Snowflake is one cloud-based data warehouse platform that is gaining popularity thanks to its numerous features and efficiency.

In this article, we delve into Snowflake’s architecture, key features, and the purpose it solves.

(This article is part of our Snowflake Guide. Use the right-hand menu to navigate.)

What is Snowflake?

SnowFlakeSnowflake is a SaaS-based data warehouse (DWH) platform that runs over an AWS or MS Azure cloud infrastructure. (You might hear this called data warehouse as a service.)

Unlike other warehouse solutions, Snowflake utilizes an enhanced ANSI-compliant SQL engine that is designed to work solely on the cloud.

Fundamentally, Snowflake’s core architecture enables it to run on the public cloud, using virtual compute instances and efficient storage buckets, making it a highly scalable and cost-efficient solution to process enormous amounts of big data.

(Understand the differences between data warehouses & databases.)

Key features of Snowflake

When compared to legacy DWH technologies, Snowflake offers a number of features, including:

Snowflake Key Features

Standard & extended SQL support

As a SQL-based data warehouse, it supports the specified data-defined language and data manipulation language DML commands used by SQL. It also provides advanced DML commands for multi-table operations such as INSERT, MERGE, and MULTI-MERGE.

With Snowflake, users can:

  • Set up temporary and transient tables for short-term data
  • Use analytical and statistical aggregate functions and lateral views
  • Create user-defined functions (UDFs) to extend functionality in both SQL and JavaScript

(Compare SQL & no-SQL data storage.)

Web-based graphical user interface (GUI)

Snowflake provides a web interface for users to interact with the data cloud. With the web GUI, users can:

  • Manage their account and other general settings
  • Monitor resources and system usage
  • Query data

Command-line client (CLI)

Snowflake provides a Python-based CLI called SnowSQL for connecting to the DWH. It is a separate downloadable and installable terminal tool for executing all queries, including data definition and data manipulation queries for loading and unloading data.

(Get started with our Python introduction.)

Rich set of client connectors

Snowflake provides a wide range of connectors and drivers that users can use to connect to their data cloud. Some of these client connectors include:

  • Python Connector, a programming interface for writing Python apps that connect to Snowflake
  • NodeJS driver
  • ODBC driver for C/C++ development
  • JBDC driver for Java programming

Extensive third-party plugins

In addition to the programmatic interfaces mentioned above, several other big data tools integrate with Snowflake. These tools range from business intelligence tools to data integration, machine learning, security, and governance software.

Bulk loading & unloading data

Snowflake allows data loading in different formats and from various data sources – as long as the data uses a supported character encoding. Users can load data from:

  • Compressed files
  • AWS S3 data sources
  • Local files
  • Flat data files like CSV and TSV
  • Data files in Avro, JSON, ORC, Parquet, and XML formats

Additionally, with Snowpipe, users can continuously load data in batches from within Snowflake stages, AWS S3, or Azure storage.

Adequate data protection & security implementation

With Snowflake, users can:

  • Set regions for data storage to comply with regulatory guidelines
  • Adjust their security levels based on requirements

Snowflake also automatically encrypts data. Object-level access control offers granular control on who can access what.

Snowflake architecture

Snowflake follows a hybrid of shared-disk and shared-nothing database architecture. It consists of:

  • A central repository that persists data
  • Compute nodes within the data warehouse can access that base disk storage

For executing queries, Snowflake uses distributed Massively Parallel Processing (MPP) cluster nodes, each having its own local storage for storing portions of data locally, CPU, and memory.

Snowflake’s framework is typically segregated across three layers. All of these layers are independent of each other and can be scaled, configured, and managed individually. These layers include:

  • Storage layer
  • Compute layer
  • Cloud services layer

Storage layer

The layer at which the central repository lies. Any data loaded into the system undergoes partitioning and reorganization into Snowflake’s compressed, internally optimized columnar format, encryption using AES 256, and subsequently stored in cloud storage. Snowflake automatically does the partitioning but provides settings for users to configure partition parameters.

Data stored in this layer is central, and all nodes in the cluster can access it. Snowflake manages all aspects of data storage, thereby allowing users to only interact with the underlying data through SQL queries.

Compute layer

The compute layer handles the execution of queries. It does this using virtual warehouses—that are independent MPP compute clusters with multiple compute nodes.

Snowflake assigns these compute nodes from a chosen cloud provider to each user. These clusters are autonomous—having their own CPU, memory, and local storage—where the performance of one does not affect the others.

Cloud services layer

Snowflake provides a collection of services for administering and managing a Snowflake data cloud. This layer is where several activities happen:

  • Access control
  • Authentication
  • Infrastructure management
  • Metadata management
  • Query parsing
  • Optimization

Why use Snowflake?

There are plenty of reasons organizations opt for Snowflake. Here are the top reasons:

  • Hybrid architecture offers users the best of both worlds. Users pay separately for the underlying central repository and as much compute power as they require.
  • SQL-based for fast learning. A SQL-based implementation ensures developers do not have to go through a steep learning curve to understand new technology.
  • Data first. Supports data cloning and secure data sharing.
  • No infrastructure configuration. Snowflake does not require any infrastructure configuration –instead, Snowflake does it automatically once you’ve chosen the preferred cloud service provider.

Getting started with Snowflake

Ready to get started? Snowflake currently offers a 30-day free trial to new users. Once you get access, you can:

Snowflake is cloud native

Cloud-native services are the new normal.

Snowflake is one DWH service that has been built specifically for the cloud that allows organizations to handle enormous big data storage and processing by allowing to scale compute and storage independently. For faster query execution and improved performance, Snowflake allows users to scale up with additional data warehouses by offering extra compute resources, as required.

While offering enhanced DWH features, Snowflake helps to cut-down costs of provisioning infrastructure and the redundant efforts of managing it, allowing organizations to focus on generating efficient analytics—the whole purpose of data.

Related reading

]]>
Using Stored Procedures in Snowflake https://www.bmc.com/blogs/snowflake-stored-procedures/ Thu, 29 Oct 2020 07:13:24 +0000 https://www.bmc.com/blogs/?p=19069 Snowflake supports stored procedures. Stored procedures let you write a series of commands and store them for later use. (This article is part of our Snowflake Guide. Use the right-hand menu to navigate.) When to use a stored procedure We’ve previously covered user defined functions (UDFs), which you use on a column. A stored procedure […]]]>

Snowflake supports stored procedures. Stored procedures let you write a series of commands and store them for later use.

(This article is part of our Snowflake Guide. Use the right-hand menu to navigate.)

When to use a stored procedure

We’ve previously covered user defined functions (UDFs), which you use on a column. A stored procedure runs by itself. In other words, it goes off and “does something,” like update a table. That’s why stored procedures are good for batch-like actions.

You can also conditionally tie stored procedures to database events, not unlike what’s called a trigger in other database products.

Data engineers can make a data pipeline with stored procedures, too. (A subject we will explore in depth in following posts.)

JavaScript for stored procedures

Snowflake stored procedures must be written in JavaScript. It makes sense that you must use a programming language beside SQL, since SQL does not support variable assignment. You would create variables to run calculations, etc.

Don’t worry if you don’t know JavaScript—you can simply copy boilerplate code and put SQL into the proper location. The SQL is really the only part that varies, for the most part. So, there is not much you need to understand.

Snowflake JavaScript is bare bones JavaScript. It does not let you import libraries that are external to the language. So, you can create arrays, variables, simple objects and there is error handling. But you could not, for example, use math functions form the Math library

Line by line tutorial: Stored procedure

We will put one simple example here and explain each line of the code. (So, you don’t need any sample data.)

Look at the function below. Note the following:

  • You put parameters to the function functionname(parameters type)
  • You call the function by writing call functionname(parameters).
  • The function must return some value, even if it is just doing an update. Otherwise you will get the error NULL result in a non-nullable column. This is because the worksheet editor in Snowflake needs something to display, even if it’s a null (Nan) value.
  • The basic procedure is to use execute() to run SQL code that you have stored in a string. Database programmers know that creates what is called a resultset. So, you need to pull the first returned value into scope by calling next(). There is a result set with a select statement, delete, insert, and even update—even though you would not expect those to return any values.
  • If the SQL statement returns more than one row, like in a SELECT, you would use a while (rs.next()) to loop through the results.
  • For whatever reason the parameters can be lowercase but must be uppercase inside the JavaScript code. You will get an error if you try to use lowercase letters.
create or replace procedure setprice(ORDERNUMBER varchar(100))
    returns float 
    not null
    language javascript
    as
    $$
  
     
    sql_command = "update orders set price = 2 where ordernumber = " + ORDERNUMBER   ;
    
   
    var stmt = snowflake.createStatement(
           {
           sqlText: sql_command
           }
        );
    var res = stmt.execute();
    res.next()
    
    price = res.getColumnValue(1);
    return price;
     
  
    $$
    ;
 
 call setprice(489)

Additional resources

For more tutorials like this, explore these resources:

]]>
Creating & Using Snowflake Streams https://www.bmc.com/blogs/snowflake-table-streams/ Wed, 28 Oct 2020 00:00:25 +0000 https://www.bmc.com/blogs/?p=19057 In this tutorial, we’ll show how to create and use streams in Snowflake. (This article is part of our Snowflake Guide. Use the right-hand menu to navigate.) Streams in Snowflake explained A Snowflake stream—short for table stream—keeps track of changes to a table. You can use Snowflake streams to: Emulate triggers in Snowflake (unlike triggers, […]]]>

In this tutorial, we’ll show how to create and use streams in Snowflake.

(This article is part of our Snowflake Guide. Use the right-hand menu to navigate.)

Streams in Snowflake explained

A Snowflake stream—short for table stream—keeps track of changes to a table. You can use Snowflake streams to:

  • Emulate triggers in Snowflake (unlike triggers, streams don’t fire immediately)
  • Gather changes in a staging table and update some other table based on those changes at some frequency

Tutorial use case

Here we create a sample scenario: an inventory replenishment system. When we receive replenishment orders, we need to increase on-hand inventory.

We run this task manually. In actual use, you would want to run it as a Snowflake task on some kind of fixed schedule.

Create the data, stream & tables

In order to follow along, create the orders and products table:

  • Orders are inventory movements.
  • Products holds the inventory on-hand quantity.

If you start with 25 items and make three replenishment orders of 25, 25, and 25, you would have 100 items on hand at the end. Sum those three orders and add 75 to the starting balance of 25 to get 100.

Create these two tables:

CREATE TABLE orders
  ( 
     customernumber     varchar(100) PRIMARY KEY,
    ordernumber varchar(100),
    comments varchar(200),
    orderdate date,
    ordertype varchar(10),
    shipdate date,
discount number,
quantity int,
    productnumber varchar(50)
)
create table products (
      productnumber varchar(50) primary key,
      movementdate datetime,
      quantity number,
      movementtype varchar(10));

Now, add a product to the products table and give it a starting 100 units on-hand inventory.

insert into products(productnumber, quantity) values ('EE333', 100);

Now create a stream on the orders table. Snowflake will start tracking changes to that table.

CREATE OR REPLACE STREAM orders_STREAM on table orders;

Now create an order.

insert into orders (customernumber,ordernumber,comments,orderdate,ordertype,shipdate,discount,quantity,productnumber) values ('855','533','jqplygemaq','2020-10-08','sale','2020-10-18','0.10503143596496034','65','EE333')

Then query the orders_stream table:

select * from orders_stream

Here are the results. You can see that we added one record.

Update on-hand inventory

We have received 65 more items into inventory, so we need to update the inventory balance. Procedure as follows.

Start a transaction using the begin statement.

begin;

Then run this update statement, which basically:

  • Sums the orders for each product
  • Adds the sum of orders quantities to the original inventory balance.

This update statement gets the product numbers from the orders stream table. That’s the table that tells Snowflake which products need to have their inventory updated.

update products
     set quantity = onhand 
     from 
      (select distinct p.productnumber, 
            p.quantity as dquantity, 
            o.quantity , 
            p.quantity + o.quantity as onhand
       from products p
       inner join orders o on 
              p.productnumber = o.productnumber)  as z 
        where z.productnumber = (select productnumber from orders_stream)
commit:

At this point the orders_stream table is emptied, which happens when you execute a read on it.

(Note: Begin and commit make a transaction, which is a logically related set of SQL statements. They lock the tables involved. Without that, you could end up with a mismatched situation, like an incorrect inventory balance because one transaction worked and the other did not.)

Now query orders_stream and you will see that the table is empty.

Additional resources

For more tutorials like this, explore these resources:

]]>
User Defined Functions (UDFs) in Snowflake https://www.bmc.com/blogs/snowflake-user-defined-functions/ Tue, 27 Oct 2020 00:00:41 +0000 https://www.bmc.com/blogs/?p=19007 In this tutorial, we show you how to create user defined functions (UDF) in Snowflake. In Snowflake, you can create: Functions in SQL and JavaScript languages Functions that return a single value (scalar) Functions that return multiple values (table) (This article is part of our Snowflake Guide. Use the right-hand menu to navigate.) Create data […]]]>

In this tutorial, we show you how to create user defined functions (UDF) in Snowflake.

In Snowflake, you can create:

  • Functions in SQL and JavaScript languages
  • Functions that return a single value (scalar)
  • Functions that return multiple values (table)

(This article is part of our Snowflake Guide. Use the right-hand menu to navigate.)

Create data

If you want to follow the tutorials below, use the instructions from this tutorial on statistical functions to load some data into Snowflake. The data is 41 days of hourly weather data from Paphos, Cyprus.

Snowflake UDF SQL function

We start with a SQL example.

The code below takes the input weather conditions, described in the table column main, and converts that to an integer. This solves a common problem with machine learning: converting categorical data to an integer.

Notice that the function has parameters (dt varchar(20)) and a return value (int). The rest of it is just a SQL select statement.

The code below uses the iff() and regex() statement to see whether the word rain, cloud, etc., is found in the main column. It works by adding the numbers from 1 to 9. Since only one of these if statements will be true, then the sum will be one of the values 1 to 9, thus giving the weather conditions.

create or replace function weathercategorical (dt varchar(20) )
  returns int
  as $$select (iff(main regexp '.*Clear.*',1,0) +
      iff(main regexp '.*Clouds.*',2,0) +
     iff(main regexp '.*Rain.*',3,0) +
     iff(main regexp '.*Thunderstorm.*', 4,0) +
     iff(main regexp '.*Mist.*', 5, 0) +
     iff(main regexp '.*Fog.*', 6, 0) +
     iff(main regexp '.*Squall.*',7,0) +
     iff(main regexp '.*Tornado.*', 8, 0) +
     iff(main regexp '.*Haze.*', 9, 0))  
        from weather as w where w.dt = dt$$

The date and time is in epoch time format. The SQL statement below calls the function weathercategorical for the date January 1, 2000, returning the scalar value 1, meaning clear weather.

sselect weathercategorical (946684800) from weather where dt = 946684800

Snowflake table function

Here we show how to return more than one value, which Snowflake calls a table.

Create these two tables:

CREATE TABLE customers
  ( 
     customernumber     varchar(100) PRIMARY KEY, 
    customername varchar(50),
    phonenumber varchar(50),
    postalcode varchar(50),
    locale varchar(10),
    datecreated date,
    email varchar(50)
  );


CREATE TABLE orders
  ( 
     customernumber    varchar(100) ,
    ordernumber varchar(100) PRIMARY KEY,
    comments varchar(200),
    orderdate date,
    ordertype varchar(10),
    shipdate date,
discount float,
quantity int,
    productnumber varchar(50)
);

Then copy and paste this data.

Scalar vs table function

Now we create a function to look up the customer name and email given a record from the order table. Orders don’t contain customer information, so it’s like doing a join. But since it’s a function, it’s far less wordy and more convenient than creating a join every time you need customer information with the order.

create or replace function getcustomer (customernumber number )
returns table (customername varchar, email varchar)
as 'select customername, email from customers
    where customers.customernumber = customernumber';

Given the customer number from the orders table, this statement gets:

  • The customer’s name
  • Order number
  • Email
select  getcustomer (948 ), ordernumber from orders where customernumber = 948;

JavaScript UDFs

You can use JavaScript in a user defined function. Just put language javascript.

Let’s calculate n factorial (n!) since Snowflake does not have that math function. n factorial n! is n * (n-1) * (n-2) * .. ** (n – (n + 1))). For example: 3!=3*2*1=6.

Notice below that we use variant as data type since JavaScript does not have integer types.

CREATE OR REPLACE FUNCTION factorial(n variant)
  RETURNS variant
  LANGUAGE JAVASCRIPT
  AS ' 
     var f=n;
     for (i=n-1; i>0; i--) {
  f=f*i
    }
 return f';

Run it and it calculates the value 6.

select factorial(3)

Note that 33 is the largest number that function can handle. 33! = 8683317618811886495518194401280000000

Additional resources

For more tutorials like this, explore these resources:

]]>
Snowflake: Using Analytics & Statistical Functions https://www.bmc.com/blogs/snowflake-analytics-statistical-functions/ Thu, 15 Oct 2020 00:00:41 +0000 https://www.bmc.com/blogs/?p=18928 Snowflake does not do machine learning. It only has simple linear regression and basic statistical functions. (If you want to do machine learning with Snowflake, you need to put the data into Spark or another third-party product.) You can, however, do analytics in Snowflake, armed with some knowledge of mathematics and aggregate functions and windows […]]]>

Snowflake does not do machine learning. It only has simple linear regression and basic statistical functions. (If you want to do machine learning with Snowflake, you need to put the data into Spark or another third-party product.)

You can, however, do analytics in Snowflake, armed with some knowledge of mathematics and aggregate functions and windows functions. Basic analytics is all you need in most situations, and it is the first step towards more elaborate analysis.

In this tutorial, we show you how to use Snowflake statistical functions with some examples. We will demonstrate these functions:

First, let’s set up our work.

(This article is part of our Snowflake Guide. Use the right-hand menu to navigate.)

Sample data

You need some data to work through this example. Download 1,000 hourly weather records from here.

(We purchased 20 years of weather data for Paphos, Cyprus from OpenWeather. This is just a small subset of 1,000 records, converted to CSV format. We have worked with this data in other tutorials, including Loading CSV Files from S3 to Snowflake.)

Upload this data to Amazon S3 like this:

aws s3 cp paphosWeather.csv s3://gluebmcwalkerrowe/paphosWeather.csv

Create table in Snowflake

Unfortunately, Snowflake does not read the header record and create the table for you. (Presumably that is because they would prefer that you define the column data types and number precision and size yourself.)

create table weather(
dt integer,
temp decimal(6,2),
temp_min decimal(6,2), 
temp_max decimal(6,2), 
pressure int, 
humidity int, 
speed decimal(6,2), 
deg int, 
main varchar(50),
description varchar(50))

Copy data file to Snowflake stage area

Then create a Snowflake stage area like this.

(It does not matter what your Snowflake credentials are: put your Amazon AWS credentials. In fact, there is no place to look up your Snowflake credentials since they are the same as your Amazon IAM credentials even if you are running it on a Snowflake instance, i.e., you did not install Snowflake locally on your EC2 servers.)

create or replace stage paphosweather 
url='s3://gluebmcwalkerrowe/paphosWeather.csv'
  credentials=(aws_key_id='xxxxxxx' aws_secret_key='xxxxxxx')

Now copy the data into the table you created above.

copy into weather
  from 's3://gluebmcwalkerrowe/paphosWeather.csv' 
  credentials=(aws_key_id='xxxxxxx' aws_secret_key='xxxxxxxx')
  file_format = (type = csv field_delimiter = ',' skip_header = 1);

Covariance

Covariance c of the variables x and y is > 0 when an increase in x results in an increase in y. It’s a measure of how two variables are related to each other. Let’s show how to do this in Snowflake.

Each of the queries below wraps an inner query in an outer query. This is necessary because:

  • Inner query runs some calculations making a smaller set that contains the result of those calculations.
  • The outer query runs next. It uses the columns passed to it from the inner query to make the query process a two-step process. In other words, in Step 1 we create some set A. In Step 2, we produce either another set, call it B, or a scalar, which means a single number.

The data we have is hourly weather data from one location: Paphos, Cyprus. Meteorologists say that falling barometric pressure results in an increase in air speed. So, let’s measure that by calculating the covariance between the decrease in air pressure and the increase in wind speed.

(Remember that we ran the tutorial below using 117,000 hourly weather data records from OpenWeather for a single city, Paphos, Cyprus, over 20 years. You can download 1,000 records, 41 days, from here. Or, purchase data for your location to investigate historical conditions there.)

We calculate:

  1. The change in air pressure using the lag() windows function to look at the pressure at two points in time. A window function runs a calculation over adjacent rows in query results. So, we use lag(pressure,8) to calculate the change in air pressure over the previous 8 hours, i.e. 8 rows behind the current row. Each record in the database is 1 hour.
  2. Then we calculate the change in wind speed over the same 8 hours using lag() as well.

Use simple subtraction to calculate the change. We order the data by the date column dt as we are working with time so the dates must be in order.

select to_timestamp(dt)
pressure, main, 
lag(pressure, 1)over (order by dt)  as pressure1,
lag(pressure, 8) over (order by dt) as pressure8, 
pressure - pressure8 as pressurechange,
 speed ,
lag(speed, 1) over (order by dt) as speed1,
lag(speed, 8) over (order by dt) as speed8,
speed8 -  speed1 as windchange
from weather
order by dt desc

Now we wrap the inner function inside the outer function. Note that:

  • Since the inner function gave the name to the calculations pressurechange and windchange we simply write covar_pop(pressurechange, windchange).
  • We add the column main, which is cloudy, rainy, thunderstorm. etc. to show the covariance for each weather condition.
select  main, covar_pop(pressurechange, windchange) as covariance from 
(select to_timestamp(dt)
pressure, main, 
lag(pressure, 1)over (order by dt)  as pressure1,
lag(pressure, 8) over (order by dt)as pressure8, 
pressure - pressure8 as pressurechange,
 speed ,
lag(speed, 1) over (order by dt) as speed1,
lag(speed, 8) over (order by dt) as speed8,
speed8 -  speed1 as windchange
from weather
order by dt desc)
group by main

Here we see that a decrease in barometric pressure results in an increase in wind speed, as we would expect. That effect is strongest in thunderstorms.

MAIN COVARIANCE
Clouds’ 2.682560523
Clear’ 2.578284619
Thunderstorm’ 3.578449916
Rain’ 3.328280913
Dust’ 1.697657949
Squall’ 2.921487603
Mist’ 2.513627372
Fog’ 2.9256
Haze’ 3.565528153
Tornado’ -1.425

Average, maximum, minimum, standard deviation

The average, maximum, minimum, and standard deviation are aggregate functions. That means they expect a group by clause. These are the most basic statistics. They give you an idea of how the average and average variation in some metrics over time. Note that:

  • to_timestamp() converts the epoch time (which is seconds since 1970/01/01) to yyyy-mm-dd hh:mm:ss.
  • We use date_part() to pull out the month and year.
select  round(avg(temp),2) as average, round(stddev(temp),2) as std, max(temp) as maxtemp, 
min(temp) as mintemp, date_part(year, to_timestamp(dt)) as year,  date_part(month,to_timestamp(dt)) as month
from weather
where to_timestamp(dt) > '2017-01-01'
group by date_part(year, to_timestamp(dt)), date_part(month,to_timestamp(dt))
order by date_part(year, to_timestamp(dt)), date_part(month,to_timestamp(dt))

Here are the results:

AVERAGE STD MAXTEMP MINTEMP YEAR MONTH
53.4 5.57 64.08 37.42 2017 1
54.4 6.37 72.84 40.44 2017 2
59.17 5.1 70.84 45.91 2017 3
63.79 5.36 76.78 51.67 2017 4
79.28 0.1 79.39 79.21 2017 8
78.35 4.9 88.5 67.21 2017 9
71.86 4.86 82.2 62.13 2017 10
64.08 5.29 78.31 50.61 2017 11
59.55 5.69 69.87 45.84 2017 12
56.5 4.99 66.42 44.31 2018 1
58.81 4.47 68.04 47.88 2018 2
61.63 5.27 77.61 50.56 2018 3
65.91 6.07 78.75 51.94 2018 4
75.88 1.43 77.54 75.04 2018 8
79.21 5.01 88.81 66.67 2018 9
73.28 5.97 85.15 58.28 2018 10

We can get a year-by-year comparison by sorting the results by month first and then year:

select  round(avg(temp),2) as average, round(stddev(temp),2) as std, max(temp) as maxtemp, 
min(temp) as mintemp, date_part(year, to_timestamp(dt)) as year,  date_part(month,to_timestamp(dt)) as month
from weather
group by date_part(year, to_timestamp(dt)), date_part(month,to_timestamp(dt))
order by date_part(month,to_timestamp(dt)) ,date_part(year, to_timestamp(dt)) 

Correlation

Obviously the months correlated with the temperature as it gets hotter in summer and colder in winter. How strong is that correlation? You would think it would be close to 100%. Here, it’s closed to 70% in the mild climate of Cyprus.

This results in a scalar statistic since it returns one value, corr(month, average) instead of rows.

select corr(month, average) from 
(select  round(avg(temp),2) as average, round(stddev(temp),2) as std, max(temp) as maxtemp, 
min(temp) as mintemp, date_part(year, to_timestamp(dt)) as year,  date_part(month,to_timestamp(dt)) as month
from weather
where to_timestamp(dt) > '2017-01-01'
group by date_part(year, to_timestamp(dt)), date_part(month,to_timestamp(dt))
order by date_part(year, to_timestamp(dt)), date_part(month,to_timestamp(dt)))

Results in this scalar value. So, they are positively correlated, but less than we would imagine.

CORR(MONTH, AVERAGE)
0.6127760999

Rank

Now let’s calculate the hottest average months since 2017. We could do that with a regular SQL function, but if we use the rank() windows function we can number each row in the resulting set. That number is the rank.

This is a two-step function, too, because we need to convert hourly weather records to monthly so that we don’t have too many lines:

  1. Calculate average temperature for each month.
  2. Select the metrics that we want ranked. This is given by putting the column average in the (order by average desc [descending]) statement.
select year, month, average, rank() over (order by average desc) as hottest from 
(select  round(avg(temp),2) as average, round(stddev(temp),2) as std, max(temp) as maxtemp, 
min(temp) as mintemp, date_part(year, to_timestamp(dt)) as year,  date_part(month,to_timestamp(dt)) as month
from weather
where to_timestamp(dt) > '2017-01-01' 
group by date_part(year, to_timestamp(dt)), date_part(month,to_timestamp(dt)))

Here are the results (in Fahrenheit).

 

YEAR MONTH AVERAGE HOTTEST
2020 9 81.73 1
2019 8 79.36 2
2017 8 79.28 3
2018 9 79.21 4
2017 9 78.35 5
2019 9 77.74 6
2020 8 76.48 7
2018 8 75.88 8
2019 10 73.79 9
2018 10 73.28 10
2017 10 71.86 11
2019 11 68.13 12
2018 11 66.25 13
2018 4 65.91 14

Additional resources

For more tutorials like this, explore these resources:

]]>
Loading CSV Files from S3 to Snowflake https://www.bmc.com/blogs/snowflake-load-csv-files/ Tue, 13 Oct 2020 00:00:27 +0000 https://www.bmc.com/blogs/?p=18873 In this tutorials, we show how to load a CSV file from Amazon S3 to a Snowflake table. We’ve also covered how to load JSON files to Snowflake. (This article is part of our Snowflake Guide. Use the right-hand menu to navigate.) Sample data You need some data to work through this example. Download 1,000 […]]]>

In this tutorials, we show how to load a CSV file from Amazon S3 to a Snowflake table.

We’ve also covered how to load JSON files to Snowflake.

(This article is part of our Snowflake Guide. Use the right-hand menu to navigate.)

Sample data

You need some data to work through this example. Download 1,000 weather records from here. (We purchased 20 years of weather data for Paphos, Cyprus, from OpenWeather. This is just a small subset of 1,000 records converted to CSV format.)

Upload this data to Amazon S3 like this:

aws s3 cp paphosWeather.csv s3://gluebmcwalkerrowe/paphosWeather.csv

Create table in Snowflake

Unfortunately, Snowflake does not read the header record and create the table for you. (Presumably that is because they would prefer that you define the column data types, number precision, and size.)

create table weather(
dt integer,
temp decimal(6,2),
temp_min decimal(6,2),
temp_max decimal(6,2),
pressure int,
humidity int,
speed decimal(6,2),
deg int,
main varchar(50),
description varchar(50))

Copy data file to Snowflake stage area

Then create a Snowflake stage area like this.

It does not matter what your snowflake credentials are. Put your Amazon AWS credentials. In fact, there is no place to look up your Snowflake credentials since they are the same as your Amazon IAM credentials—even if you are running it on a Snowflake instance, i.e., you did not install Snowflake locally on your EC2 servers.

create or replace stage paphosweather
url='s3://gluebmcwalkerrowe/paphosWeather.csv'
credentials=(aws_key_id='xxxxxxx' aws_secret_key='xxxxxxx')

Now copy the data into the table you created above.

copy into weather
  from 's3://gluebmcwalkerrowe/paphosWeather.csv' 
  credentials=(aws_key_id='xxxxxxx' aws_secret_key='xxxxxxxx')
  file_format = (type = csv field_delimiter = ',' skip_header = 1);

Convert the epoch time to readable format

The dt column is epoch time, which is the number of seconds since January 1, 1970. You can convert it to readable format (e.g., 2000-01-01 01:00:00.000) like this.

select to_timestamp(dt) from weather

Snowflake date and time formats

Snowflake seems to have some limits on importing date and time values. The data I had included the epoch time () as well as the time in this format:

2020-09-12 23:00:00 +0000 UTC

which I converted to this format, were the +00.00 means the time zone offset in hh:mm:

2020-09-12 23:00:00+00:00

This format is also available, which drops the time zone altogether:

2020-09-12 23:00:00

Because I could not get any of those values to load, I used the epoch time. (If you can make it work write to us at blogs@bmc.com.)

2020-09-12 23:00:00+00:00 should match this formatting statement, which you set using the alter session timestamp statement prior to loading the csv file:

alter session set TIMESTAMP_INPUT_FORMAT = 'yyyy-mm-dd HH24:MI:SS+TZH:TZM'

But Snowflake threw an error saying they did not match. That makes no sense since this statement, which tests that, worked with no problem:

select to_timestamp('2000-01-01 00:00:00+00:00', 'yyyy-mm-dd HH24:MI:SS+TZH:TZM');

Looking further, I saw in the documentation that the Snowflake datetime format does not support milliseconds or time zone offset. But I still got an error when I dropped that part of the date. The timestamp format did not work either. (Again, write to us if you do get this to work at blogs@bmc.com.)

Additional resources

For more tutorials like this, explore these resources:

]]>
Snowflake Window Functions: Partition By and Order By https://www.bmc.com/blogs/snowflake-windows-functions-partition-by-order-by/ Fri, 09 Oct 2020 00:00:42 +0000 https://www.bmc.com/blogs/?p=18821 Snowflake supports windows functions. Think of windows functions as running over a subset of rows, except the results return every row. That’s different from the traditional SQL group by where there is one result for each group. A windows function could be useful in examples such as: A running sum The average values over some […]]]>

Snowflake supports windows functions. Think of windows functions as running over a subset of rows, except the results return every row. That’s different from the traditional SQL group by where there is one result for each group.

A windows function could be useful in examples such as:

  • A running sum
  • The average values over some number of previous rows
  • A percentile ranking of each row among all rows.

The topic of window functions in Snowflake is large and complex. This tutorial serves as a brief overview and we will continue to develop additional tutorials.

(This article is part of our Snowflake Guide. Use the right-hand menu to navigate.)

Snowflake definitions

Snowflake defines windows as a group of related rows. It is defined by the over() statement. The over() statement signals to Snowflake that you wish to use a windows function instead of the traditional SQL function, as some functions work in both contexts.

A windows frame is a windows subgroup. Windows frames require an order by statement since the rows must be in known order.

Windows frames can be cumulative or sliding, which are extensions of the order by statement. Cumulative means across the whole windows frame. Sliding means to add some offset, such as +- n rows.

A window can also have a partition statement. A partition is a group of rows, like the traditional group by statement.

Windows vs regular SQL

For example, if you grouped sales by product and you have 4 rows in a table you might have two rows in the result:

Regular SQL group by

select count(*) from sales group by product:

10 product A
20 product B

Windows function

With the windows function, you still have the count across two groups but each of the 4 rows in the database is listed yet the sum is for the whole group, when you use the partition statement.

count 10 product A
count 10 product A
count 20 product B
count 20 product B

Create some sample data

To study this, first create these two tables.

CREATE TABLE customers
  ( 
     customernumber     varchar(100) PRIMARY KEY, 
    customername varchar(50),
    phonenumber varchar(50),
    postalcode varchar(50),
    locale varchar(10),
    datecreated date,
    email varchar(50)
  );


CREATE TABLE orders
  ( 
     customernumber     varchar(100) PRIMARY KEY,
    ordernumber varchar(100),
    comments varchar(200),
    orderdate date,
    ordertype varchar(10),
    shipdate date,
discount number,
quantity int,
    productnumber varchar(50)
)

Then paste in this SQL data. The top of the data looks like this:

insert into customers (customernumber,customername,phonenumber,postalcode,locale,datecreated,email) values ('440','tiqthogsjwsedifisiir','3077854','vdew','','2020-09-27','twtp@entt.com');

insert into orders (customernumber,ordernumber,comments,orderdate,ordertype,shipdate,discount,quantity,productnumber) values ('440','402','swgstdhmju','2020-09-27','sale','2020-10-01','0.7005950240358919','61','BB111');

insert into customers (customernumber,customername,phonenumber,postalcode,locale,datecreated,email) values ('802','hrdngzutwelfhgwcyznt','1606845','rnmk','','2020-09-27','ympv@zfze.com');

insert into orders (customernumber,ordernumber,comments,orderdate,ordertype,shipdate,discount,quantity,productnumber) values ('802','829','jybwzvoyzb','2020-09-27','sale','2020-10-06','0.3702248922841853','75','FF4444');

insert into customers (customernumber,customername,phonenumber,postalcode,locale,datecreated,email) values ('199','ogvaevvhhqtjcqggafnv','8452159','hyxm','','2020-09-27','znqo@rftp.com');

Partition by

A partition creates subsets within a window. Here, we have the sum of quantity by product.

select customernumber, ordernumber, productnumber,quantity, 
        sum(quantity) over (partition by productnumber) as prodqty
               from orders 
               order by ordernumber

This produces the same results as this SQL statement in which the orders table is joined with itself:

select customernumber, 
        ordernumber,
        productnumber,quantity, 
        (select sum(quantity) from orders as o2 where o1.productnumber = o2.productnumber) as prodqty
               from orders as o1
         order by ordernumber

Order by

The sum() function does not make sense for a windows function because it’s is for a group, not an ordered set. Yet Snowflake lets you use sum with a windows frame—i.e., a statement with an order() statement—thus yielding results that are difficult to interpret.

Let’s look at the rank function, one that is relevant to ordering. Here, we use a windows function to rank our most valued customers. These are the ones who have made the largest purchases.

The rank() function takes no arguments. The window is ordered by quantity in descending order. We limit the output to 10 so it fits on the page below.

select customernumber, quantity, rank() over (order by quantity desc) from orders  limit 10

Here is the output. The customer who has purchases the most is listed first.

Additional resources

For more tutorials like this, explore these resources:

 

]]>
Snowflake Lag Function and Moving Averages https://www.bmc.com/blogs/snowflake-lag-function/ Fri, 09 Oct 2020 00:00:42 +0000 https://www.bmc.com/blogs/?p=18842 This tutorials shows you how to use the lag function to calculate moving averages in Snowflake. It builds upon work we shared in Snowflake SQL Aggregate Functions & Table Joins and Snowflake Window Functions: Partition By and Order By. (This article is part of our Snowflake Guide. Use the right-hand menu to navigate.) Using lag […]]]>

This tutorials shows you how to use the lag function to calculate moving averages in Snowflake.

It builds upon work we shared in Snowflake SQL Aggregate Functions & Table Joins and Snowflake Window Functions: Partition By and Order By.

(This article is part of our Snowflake Guide. Use the right-hand menu to navigate.)

Using lag to calculate a moving average

We can use the lag() function to calculate a moving average. We use the moving average when we want to spot trends or to reduce the volatility from the daily number as it can vary widely.

In other words, it’s better to look at a week of sales versus one day to see how a product is performing.

Create sample data

To study this, first create these two tables.

CREATE TABLE customers
  ( 
     customernumber     varchar(100) PRIMARY KEY, 
    customername varchar(50),
    phonenumber varchar(50),
    postalcode varchar(50),
    locale varchar(10),
    datecreated date,
    email varchar(50)
  );


CREATE TABLE orders
  ( 
     customernumber     varchar(100) PRIMARY KEY,
    ordernumber varchar(100),
    comments varchar(200),
    orderdate date,
    ordertype varchar(10),
    shipdate date,
discount number,
quantity int,
    productnumber varchar(50)
)

Then paste in this SQL data. The top of the data looks like this:

insert into customers (customernumber,customername,phonenumber,postalcode,locale,datecreated,email) values ('440','tiqthogsjwsedifisiir','3077854','vdew','','2020-09-27','twtp@entt.com');

insert into orders (customernumber,ordernumber,comments,orderdate,ordertype,shipdate,discount,quantity,productnumber) values ('440','402','swgstdhmju','2020-09-27','sale','2020-10-01','0.7005950240358919','61','BB111');

insert into customers (customernumber,customername,phonenumber,postalcode,locale,datecreated,email) values ('802','hrdngzutwelfhgwcyznt','1606845','rnmk','','2020-09-27','ympv@zfze.com');

insert into orders (customernumber,ordernumber,comments,orderdate,ordertype,shipdate,discount,quantity,productnumber) values ('802','829','jybwzvoyzb','2020-09-27','sale','2020-10-06','0.3702248922841853','75','FF4444');

insert into customers (customernumber,customername,phonenumber,postalcode,locale,datecreated,email) values ('199','ogvaevvhhqtjcqggafnv','8452159','hyxm','','2020-09-27','znqo@rftp.com');

Write SQL statement

Now we want to calculate the moving average total sales over the previous four days.

Here, we have a select statement inside a select statement because we want one order total per day. Then the lag statement looks over that one record to look at the previous day.

select  shipdate, (quantity + lag(quantity,1) over (order by shipdate) + 
lag(quantity,2) over (order by shipdate) + 
lag(quantity,3)over (order by shipdate) + lag(quantity,4) 
over (order by shipdate)) / 5  as movingaverage from  

(select shipdate, sum(quantity) as quantity from orders group by shipdate);

Here is the moving average. The first rows are null as the lag function looks back further that the window for rows that don’t exist.

We can prove that this calculation is correct, by calculating this another way.

Let’s sum orders by ship date.

select  shipdate, sum(quantity)
from orders group by shipdate
order by shipdate;

Then we copy the results into a spreadsheet:

  • On the left is the windows function.
  • On the right is the query above.

I have added a column using the spreadsheet function average() to show that the numbers are the same. So, you can easily see how the windows lag function works.

windows function	sum and group by	=AVERAGE(E3:E7)
SHIPDATE	MOVINGAVERAGE	SHIPDATE	SUM(QUANTITY)	moving average
2020-09-30		2020-09-30	427	
2020-10-01		2020-10-01	230	
2020-10-02		2020-10-02	657	
2020-10-03		2020-10-03	604	
2020-10-04	488.6	2020-10-04	525	488.6
2020-10-05	462	2020-10-05	294	462
2020-10-06	547.2	2020-10-06	656	547.2
2020-10-07	485.2	2020-10-07	347	485.2
2020-10-08	470.8	2020-10-08	532	470.8
2020-10-09	486.2	2020-10-09	602	486.2
2020-10-10	495.2	2020-10-10	339	495.2
2020-10-11	465.8	2020-10-11	509	465.8

Additional resources

For more tutorials like this, explore these resources:

]]>
Snowflake SQL Aggregate Functions & Table Joins https://www.bmc.com/blogs/snowflake-sql-aggregate-functions/ Wed, 07 Oct 2020 09:50:22 +0000 https://www.bmc.com/blogs/?p=18812 In this article, we explain how to use aggregate functions with Snowflake. (This article is part of our Snowflake Guide. Use the right-hand menu to navigate.) What are aggregate functions? Aggregate functions are those that perform some calculation over all the rows or subsets of rows in a table. For example, the simplest aggregate function […]]]>

In this article, we explain how to use aggregate functions with Snowflake.

(This article is part of our Snowflake Guide. Use the right-hand menu to navigate.)

What are aggregate functions?

Aggregate functions are those that perform some calculation over all the rows or subsets of rows in a table.

For example, the simplest aggregate function is count(). You could count all the customers in a table using count(*) with no group or where clause. The * tells Snowflake to look at all columns, but you could have put just one column as it means the same thing.

select count(*) from orders

But if you want to count orders over some subset you could, for example, count customers by order type:

select ordertype, count(*) from orders
group by ordertype;

Create some sample data

Let’s create some sample data in order to explore some of these functions. Log into Snowflake and click the Create Database button to create a database called inventory. Next, open the worksheet editor and paste in these two SQL commands:

CREATE TABLE customers
  ( 
     customernumber     varchar(100) PRIMARY KEY, 
    customername varchar(50),
    phonenumber varchar(50),
    postalcode varchar(50),
    locale varchar(10),
    datecreated date,
    email varchar(50)
  );


CREATE TABLE orders
  ( 
     customernumber     varchar(100) PRIMARY KEY,
    ordernumber varchar(100),
    comments varchar(200),
    orderdate date,
    ordertype varchar(10),
    shipdate date,
discount number,
quantity int,
    productnumber varchar(50)
)

Then paste in this data. The data looks like this:

insert into customers (customernumber,customername,phonenumber,postalcode,locale,datecreated,email) values ('ee56d97a-fcaa-11ea-ab7a-0ec120e133fc','zopvxqhwocrtsonemrcf','3119110','vqlx','','2020-09-22','mnst@yoaq.com');

insert into orders (customernumber,ordernumber,comments,orderdate,ordertype,shipdate,discount,quantity,productnumber) values ('ee56d97a-fcaa-11ea-ab7a-0ec120e133fc','ee56d97b-fcaa-11ea-ab7a-0ec120e133fc','shsyuaraxxftdzooafbg','2020-09-22','sale','2020-10-01','0.7751890540939359','40','ee56d97c-fcaa-11ea-ab7a-0ec120e133fc');

Joining tables

The customer and orders tables are related by order number. Obviously you would need to bring them together in one-set when you need both customer and order data together. You do this with a join, which creates that set temporarily

You join the two tables on the column element customer number. Note that:

  • We use as to create an alias to abbreviate the table names to make it easier to type.
  • We use join instead of inner join. (Other tutorials often add inner join but it just confuses things when they are the same thing. They often write this, too, to contrast that with a left-hand, right-hand, or outer join which are like cartesian products, i.e. tack each of n orders onto each of m customers thus creating a set of n*m rows.)
select c.customernumber, c.customername, o.ordernumber,  c.datecreated, o.orderdate,  o.shipdate from customers as c
join orders as o on c.customernumber = o.customernumber;

Standard deviation

Let’s calculate the standard deviation in shipping times. We do this in three steps:

  1. Join customer and order tables
  2. Use the datediff() function to calculate the shipping time, meaning how long the customer must wait.
  3. Each outer query refers to an inner query by wrapping it in parentheses [()]. So, the query is built up in stages.

Here is the complete query. See below to see how it is broken down.

select
  avg(shiptime),
  stddev_pop(shiptime)
from
  (
    select
      customernumber,
      customername,
      orderdate,
      shipdate,
      datediff(days, orderdate, shipdate) as shiptime
    from
      (
        select
          c.customernumber,
          c.customername,
          o.ordernumber,
          c.datecreated,
          o.orderdate,
          o.shipdate
        from
          customers as c
          join orders as o on c.customernumber = o.customernumber
      )
    order by
      shiptime desc
  )

We build up the query in stages. Start at bottom (aka innermost) query and work upwards:

  1. Join the customer and orders table so that we can have the customer and order details in one set so we can list both. (We could have skipped this step since we are only using the orders table.)
(
  select
    c.customernumber,
    c.customername,
    o.ordernumber,
    c.datecreated,
    o.orderdate,
    o.shipdate
  from
    customers as c
    join orders as o on c.customernumber = o.customernumber
)

  1. Calculate the shipping time using the datediff() function
select
  count(*),
  datediff(days, orderdate, shipdate) as shiptime
from
  orders
group by
  shiptime
order by
  shiptime

  1. Calculate the standard deviation over the shipping times:
select avg(shiptime), stddev_pop(shiptime) from (step b)

Here are the results:

AVG(SHIPTIME)	STDDEV_POP(SHIPTIME)
8.539063	3.512038155

Discrete percentile

The 95th percentile means to show 95% of the population. That’s a common statistic as data outside that range is generally considered outliers.

Here we show how to calculate the 25th percentile:

select customernumber ,  PERCENTILE_disc( 0.25 ) within group (order by quantity)
from orders
  where customernumber = '5d2b742e-fcaa-11ea-ab7a-0ec120e133fc'
  group by customernumber
  order by customernumber

Results in:

CUSTOMERNUMBER	PERCENTILE_DISC( 0.25 ) WITHIN GROUP (ORDER BY QUANTITY)
5d2b742e-fcaa-11ea-ab7a-0ec120e133fc	9

Do a check and you can see that order quantities are 92, 55, and 9. So the only one in the bottom 25% percentile is 9.

select quantity from orders
where customernumber = '5d2b742e-fcaa-11ea-ab7a-0ec120e133fc'
order by quantity desc;

Here are the results:

QUANTITY
92
55
9

listagg

The listagg function lists orders by customer in an array, putting them into another format that you could use into a where clause that calls for a list of elements.

select listagg(ordernumber, '|')   
from orders
where customernumber = '5d2b742e-fcaa-11ea-ab7a-0ec120e133fc'

Here are the results:

LISTAGG(ORDERNUMBER, '|')
5d2b742f-fcaa-11ea-ab7a-0ec120e133fc|5d2b7431-fcaa-11ea-ab7a-0ec120e133fc|5d2b7433-fcaa-11ea-ab7a-0ec120e133fc

When you run queries, you should cross check them with other queries to double check your work. Here we list customer numbers straight up and down in rows.

select orderumber from orders
where where customernumber = '5d2b742e-fcaa-11ea-ab7a-0ec120e133fc'

mode

The mode() function shows the most frequent values:

select mode(quantity)   
from orders

Results in:

MODE(QUANTITY)
13

Additional resources

For more tutorials like this, explore these resources:

 

]]>
How To Query JSON Data in Snowflake https://www.bmc.com/blogs/snowflake-query-json-data/ Mon, 21 Sep 2020 07:30:15 +0000 https://www.bmc.com/blogs/?p=18704 We’ve already showed you how to create a variant column in a Snowflake table, where variant means JSON. In this tutorial, we show how to query those JSON columns. (This article is part of our Snowflake Guide. Use the right-hand menu to navigate.) Create a table with a JSON column First create a database or […]]]>

We’ve already showed you how to create a variant column in a Snowflake table, where variant means JSON. In this tutorial, we show how to query those JSON columns.

(This article is part of our Snowflake Guide. Use the right-hand menu to navigate.)

Create a table with a JSON column

First create a database or use the inventory one we created in the last post and then create a table with one column of type variant:

use database inventory;
create table jsonRecord(jsonRecord variant);

Add JSON data to Snowflake

Then, add some data. We will add simple JSON, nested JSON, and JSON arrays (i.e. JSON objects inside brackets []) to show how to query each type. Notice the parse_json() function.

INSERT INTO JSONRECORD (jsonrecord) select PARSE_JSON('{"customer": "Walker"}');
INSERT INTO JSONRECORD (jsonrecord) select PARSE_JSON('{"customer": "Stephen"}');
INSERT INTO JSONRECORD (jsonrecord) select PARSE_JSON('{"customer": "Aphrodite", "age": 32}');

These records include a JSON array, orders.

i
INSERT INTO JSONRECORD (jsonrecord) select PARSE_JSON(' {
            "customer": "Aphrodite",
            "age": 32,
            "orders": [{
                                    "product": "socks",
                                    "quantity": 4
                        },
                        {
                                    "product": "shoes",
                                    "quantity": 3
                        }
            ]
 }');
INSERT INTO JSONRECORD (jsonrecord) select PARSE_JSON(' {
            "customer": "Nina",
            "age": 52,
            "orders": [{
                                    "product": "socks",
                                    "quantity": 3
                        },
                        {
                                    "product": "shirt",
                                    "quantity": 2
                        }
            ]
 }');

This record includes nested JSON, meaning an attribute, address, whose value is another JSON object.

INSERT INTO JSONRECORD (jsonrecord) select PARSE_JSON(' {
            "customer": "Maria",
            "age": 22,
     "address" : { "city": "Paphos", "country": "Cyprus"},                                                   
            "orders": [{
                                    "product": "socks",
                                    "quantity": 3
                        },
                        {
                                    "product": "shirt",
                                    "quantity": 2
                        }
            ]
 }');

Now key select * from JSONRECORD to show all the records. Note that these are case-sensitive:

  • Function
  • Column
  • Table names

json1

How to select JSON data in Snowflake

The format for selecting data includes all of the following:

  • tableName:attribute
  • tableName.attribute.JsonKey
  • tableName.attribute.JsonKey[arrayIndex]
  • tableName.attribute[‘JsonKey’]
  •  get_path(tableName, attribute)

Here we select the customer key from the JSON record. In JSON we call the items key value pairs, like: {“key”: “value”}.

select jsonrecord:customer from JSONRECORD;

look like this:

We can also use the get_path() function:

select get_path(jsonrecord, 'address') from JSONRECORD;

Here we add a where clause, using the same colon(:) and dot (.) notation as in the other side of the select statement.

select jsonrecord:address.city from JSONRECORD where jsonrecord:customer = 'Maria';

We use an alternate approach. We get nested JSON objects by putting the keys in brackets [].

select jsonrecord['address']['city'] from JSONRECORD where jsonrecord:customer = 'Maria';

Values which do not exist are shown as NULL.

Here we pick the first element from an array since the array index (It starts at 0.).

select jsonrecord['orders'][0] from JSONRECORD where jsonrecord:customer = 'Maria';

Here we use the colon (:) to get the same column.

select jsonrecord:orders[0] from JSONRECORD where jsonrecord:customer = 'Maria';

Results:

{ "product": "socks", "quantity": 3 }

Here, we flatten the array. This record has two order JSON records. So, it shows two rows in the results, with each record attached to the other attributes.

In other words, it explodes it out to array_size rows, filling out the other columns with the non-array columns in the select statement. Think of it as an easy was to show all the orders a customer made where the order data and the customer data are repeated to make it easy to see:

select jsonrecord:customer, jsonrecord:orders  from JSONRECORD ,
   lateral flatten(input => jsonrecord:orders) prod ;

BMC, Control-M support Snowflake

BMC is a member of the Snowflake Technology Alliance Partner program. Snowflake’s cloud data platform helps customers to accelerate the data-driven enterprise with Snowflake’s market-leading, built-for-cloud data warehouse and Control-M, our market-leading enterprise application workflow orchestration platform.

Additional resources

For more tutorials like this, explore these resources:

]]>