Amazon Redshift Guide – BMC Software | Blogs

Creating Redshift User Defined Function (UDF) in Python

Walker Rowe — Tue, 03 Nov 2020 00:00:14 +0000

You can create user defined functions in Amazon Redshift in Python. If you use Amazon Lambda containerless virtual machines, then you can use additional languages. (Using AWS Lambda, you will have some usage costs. But, unless you’re budget-conscious, that’s no reason not to use it.)

You can add third-party libraries. In the case of Python, you could use Pandas NumPy, for example.

UDF example

Let’s walk through a simple example. This is a scalar function, meaning it returns a single value.

First create a table:

create table orders(
customernumber integer,
ordernumber integer,
orderdate date,
quantity smallint ,
discount decimal(3,2) ,
price decimal(8,2),
primary key(customernumber, ordernumber));

Then add one record to it.

insert into orders(customernumber ,ordernumber ,orderdate ,quantity ,discount, price )
values(123, 456, '2020-10-20', 100, 0, 30)

Then create a function. Notice the odd language name plpythonu. That is for historical reasons as it’s the name PostgreSQL uses for their Python Procedural Language. Of course, Redshift is not PostgreSQL.

The function format is basically functionName(arguments …)

Then give it a return type:

create function revenue (price float, quantity float)
  returns float
stable
as $$
  return price * quantity
$$ language plpythonu;

Now run that function over the price and quantity columns in the orders table.

select price, quantity, revenue(price, quantity)
from orders

Here are the results:

30.00,100,3000.0

Additional resources

For more tutorials like this, explore these resources:

Writing SQL Statements in Amazon Redshift

Walker Rowe — Wed, 23 Sep 2020 11:33:00 +0000

In this tutorial, we show how to write Amazon Redshift SQL statements. Since this topic is large and complex, we start with the basics.

This tutorial will show you how to:

Use the query editor
Aggregate rows using group by
Convert dates to year and month
Export the results to a csv file

Redshift query editor

To open the query editor, click the editor from the clusters screen. Redshift will then ask you for your credentials to connect to a database. One nice feature is there is an option to generate temporary credentials, so you don’t have to remember your password. It’s good enough to have a login to the Amazon AWS Console.

Below we have one cluster which we are resuming after having it in a paused state (to reduce Amazon billing charges).

You write the SQL statement here. Only one statement is allowed at a time, since Redshift can only display one set of results at a time. To write more than one statement click the plus (+) to add an additional tab.

When you run each query, it takes a few seconds as it submits the job and then runs it. So, it’s not instantaneous, as you might expect with other products.

The results are shown at the bottom where you can export those as a CSV, TXT, or HTML. You can also chart the results.

Get table schema

For this tutorial, we use a table of weather data. (See more on loading data to Amazon Redshift from S3.) This is 20 years of weather data for Paphos, Cyprus. It has four columns:

dt_iso
temp
temp_min
temp_max

dt_dso is of type timestamp and is the primary key. One nice thing about Redshift is you can load the date in almost any format you want, and Redshift understands that. Then Redshift provides the to_char() function to print out any part of the date you want, like the hour, year, minute, etc.

To look at the table schema query the pg_table_def table.

SELECT *

FROM pg_table_def

WHERE tablename = 'paphos'

AND schemaname = 'public';

Here is the schema.

schemaname,tablename,column,type,encoding,distkey,sortkey,notnull

public,paphos,dt_iso,timestamp without time zone,none,t,1,t

public,paphos,temp,real,none,f,0,f

public,paphos,temp_min,real,none,f,0,f

public,paphos,temp_max,real,none,f,0,f

Aggregate SQL statements

This query calculates the average temperature per month for the summer months May through September. Notice:

to_char() extracts any portion of the date that you want, such as YYYY year or MM month number.
We use the in() statement to select the months.
The order statement uses a 1. That means use the first column returned by the query. That’s an alternative to typing the column name.
We group by the year and month since we want to calculate the average [avg()] for month within the year
We use the round() function to round two decimal places. Otherwise Redshift gives too many decimal places.
As with other databases, the as statement lets us give an alias to the column resulting from the calculating. Without it the column would not have a descriptive name. Here we call the average temperature aveTemp.

select round(avg(temp),2) as aveTemp,

                 to_char(dt_iso,'YYYY') as year,

                 to_char(dt_iso,'MM') as month

                 from paphos where

                 month in ('05','06','07','08','09')

                 group by year, month

                 order by 1 desc

Here are the results. It shows the hottest months for the 20 years of data. I have cut off the display to make it short. For example, in the 20 years, August 2010 was the hottest month.

We grouped by year then month as we want the month within the year given daily weather observation.

avetemp	year	month
84.11	2010	8
83.12	2012	8
83.05	2012	7
82.9	2015	8
82.39	2017	7
82.04	2014	8
81.85	2007	7
81.73	2020	9
81.72	2013	8
81.72	2008	8
81.62	2000	7
81.61	2009	8
81.49	2017	8

We export the data to a csv format using the button to the right of the results. Then we import it to a spreadsheet so that we can more easily see the results and give it colors and such.

Here are the hottest years. We get that by dropping the month from the aggregation.

select round(avg(temp),2) as aveTemp,

                 to_char(dt_iso,'YYYY') as year 

                                    from paphos 

                                                group by year

                 order by 1 desc

Additional resources

For more tutorials like this, explore these resources:

BMC Machine Learning & Big Data Blog
AWS Guide, with 15 articles and tutorials
How To Import Amazon S3 Data to Snowflake
How To Connect Amazon Glue to a JDBC Database
Amazon Braket Quantum Computing: How To Get Started

How to Copy JSON Data to an Amazon Redshift Table

Walker Rowe — Wed, 23 Sep 2020 07:41:08 +0000

Here we show how to load JSON data into Amazon Redshift. In this example, Redshift parses the JSON data into individual columns. (It is possible to store JSON in char or varchar columns, but that’s another topic.)

First, review this introduction on how to stage the JSON data in S3 and instructions on how to get the Amazon IAM role that you need to copy the JSON file to a Redshift table.

In this example, we load 20 years of temperature data for Paphos, Cyprus. We purchased that data for $10 from OpenWeather. Of course, you could use any data.

Create a Redshift Table

First we create a table. We only want the date and these three temperature columns. We will give Redshift a JSONParse parsing configuration file, telling it where to find these elements so it will discard the others.

create table paphos (

dt_iso timestamp not null distkey sortkey,

temp real,

temp_min  real,

temp_max real

);

The weather data looks like this:

{

"city_name": "Paphos Castle",

"lat": 34.753637,

"lon": 32.406951,

"main": {

"temp": 55.35,

"temp_min": 51.8,

"temp_max": 65.53,

"feels_like": 49.44,

"pressure": 1016,

"humidity": 73

},

"wind": {

"speed": 9.17,

"deg": 20

},

"clouds": {

"all": 1

},

"weather": [{

"id": 800,

"main": "Clear",

"description": "sky is clear",

"icon": "01n"

}],

"dt": 946684800,

"timezone": 7200

}

Here is one record in the JSON Power Editor for Mac.

Note: I recommend this editor if you work with JSON a lot, as it makes editing JSON files a lot easier. You can work with objects in the right-hand screen which will create the text in the left-hand screen. That saves you the trouble of having to fix syntax error and line up curly brackets.

Create JSONPath file

We create a JSONPath file, which tells Redshift which elements to get. We have to give it the path of the item all the way down to the item. In other words, we can’t put just the top-level key weather and it will get temp, temp_min, and temp_max. We have to give it the full path of JSON keys main->temp.

We don’t have any arrays in this example, but it supports that using [array index] notation.

{

"jsonpaths": [

"$['dt_iso']",

"$['main']['temp']",

"$['main']['temp_min']",

"$['main']['temp_max']"

]

}

Prepare and upload JSON file to S3

This text file is 64 MB of daily weather records for the past 20 years. Unfortunately, it’s made all as one JSON array. So, we have to remove the first bracket and last bracket characters from the file.

Also, Redshift seems to require for the JSONP format that each record have a line feed at the end. Then means we need to insert a line feed for each. So, use these three sed statements to do that.

Note: JSONP file format means having one record right after another. So, taken together it’s not a valid JSON object. Instead it’s just a way to put many JSON objects into a file for bulk loading.

sed 's/^.//'

sed 's/.$//'

sed s'/,{"city/\n{"city/g'

Copy this file and the JSONPaths file to S3 using:

aws s3 cp (file)  s3://(bucket)

Load the data into Redshift

We use this command to load the data into Redshift. paphosWeather.json is the data we uploaded. paphosWeatherJsonPaths.json is the JSONPath file.

copy paphos

from 's3://gluebmcwalkerrowe/paphosWeather.json'

credentials 'aws_iam_role=arn:aws:iam::xxxxx:role/Redshift'

region 'eu-west-3'

json 's3://gluebmcwalkerrowe/paphosWeatherJsonPaths.json';

Common errors

If you have formatted the text or JSONPaths table wrong or illogically you will get any of these errors.

The first says to look into select * from stl_load_errors for more details:

ERROR: Load into table 'paphos' failed. Check 'stl_load_errors' system table for details.

This error says we have five elements in the JSONPath file but have created a database table with 13 columns. Go back and fix those.

ERROR: Number of jsonpaths and the number of columns should match. JSONPath size: 5, Number of columns in table or column list: 13 Detail: ----------------------------------------------- error: Number of jsonpaths and the number of columns should match. JSONPath size: 5, Number of columns in table or column list: 13 code: 8001 context: query: 273 location: s3_utility.cpp:780 process: padbmaster [pid=20575] -----------------------------------------------

If you put all your JSON data into an array instead of the JSONP format it will be too large. Then you might get:

String length exceeds DDL length

Check the loaded data

Here we look at the first 10 records:

select * from paphos limit 10;

Here we count them. As you can see there are 181,456 weather records.

select count(*) from paphos;

Additional resources

For more on this topic, explore these resources:

BMC Machine Learning & Big Data Blog
AWS Guide, with 15+ articles and tutorials
Amazon Braket Quantum Computing: How To Get Started

How To Load Data to Amazon Redshift from S3

Walker Rowe — Wed, 09 Sep 2020 00:00:15 +0000

There are several ways to load data into Amazon Redshift. In this tutorial, we’ll show you one method: how to copy JSON data from S3 to Amazon Redshift, where it will be converted to SQL format.

What is Amazon Redshift?

Amazon Redshift is a data warehouse that is known for its incredible speed. Redshift can handle large volumes of data as well as database migrations.

(Infamously, Amazon came up with the name Redshift in response to Oracle’s database dominance. Oracle is informally known as “Big Red”.)

Other methods for loading data to Redshift

Here are other methods for data loading into Redshift:

Write a program and use a JDBC or ODBC driver.
Paste SQL into Redshift.
Write data to Redshift from Amazon Glue.
Use EMR.
Copy JSON, CSV, or other data from S3 to Redshift.

Now, onto the tutorial.

Getting started

We will upload two JSON files to S3. Download them from here:

Note the format of these files:

JSON
There is no comma between records.
It is not a JSON array. Just JSON records one after another.

The orders JSON file looks like this. It only has two records. Notice that there is no comma between records.

{
	"customernumber": "d5d5b72c-edd7-11ea-ab7a-0ec120e133fc",
	"ordernumber": "d5d5b72d-edd7-11ea-ab7a-0ec120e133fc",
	"comments": "syjizruunqxuaevyiaqx",
	"orderdate": "2020-09-03",
	"ordertype": "sale",
	"shipdate": "2020-09-16",
	"discount": 0.1965497953690316,
	"quantity": 29,
	"productnumber": "d5d5b72e-edd7-11ea-ab7a-0ec120e133fc"
} {
	"customernumber": "d5d5b72f-edd7-11ea-ab7a-0ec120e133fc",
	"ordernumber": "d5d5b730-edd7-11ea-ab7a-0ec120e133fc",
	"comments": "uixjbivlhdtmaelfjlrn",
	"orderdate": "2020-09-03",
	"ordertype": "sale",
	"shipdate": "2020-09-16",
	"discount": 0.6820749537170963,
	"quantity": 42,
	"productnumber": "d5d5b731-edd7-11ea-ab7a-0ec120e133fc"
}

IAM role

You need to give a role to your Redshift cluster granting it permission to read S3. You don’t give it to an IAM user (that is, an Identity and Access Management user).

Attach it to a cluster—a Redshift cluster in a virtual machine where Amazon installs and starts Redshift for you.

Create the role in IAM and give it some name. I used Redshift. Give it the permission AmazonS3ReadOnlyAccess. and then paste the ARN into the cluster. It will look like this:

arn:aws:iam::xxxxxxxxx:role/Redshift

Create connection to a database

After you start a Redshift cluster and you want to open the editor to enter SQL commands, you login as the awsuser user. The default database is dev. Use the option connect with temporary password.

Create tables

Paste in these two SQL commands to create the customers and orders table in Redshift.

create table customers (
customerNumber char(40) not null distkey sortkey ,
customerName varchar(50),
phoneNumber varchar(14),
postalCode varchar(4),
locale varchar(11),
dateCreated timestamp,
email varchar(20));

create table orders (
    customerNumber char(40)  not null distkey sortkey,
    orderNumber char(40) not null,
    comments varchar(200),
    orderDate timestamp,
    orderType varchar(20),
    shipDate timestamp,
    discount real,
    quantity integer,
    productNumber varchar(50));

Upload JSON data to S3

Create an S3 bucket if you don’t already have one. If you have installed the AWS client and run aws configure you can do that with aws s3 mkdir. Then copy the JSON files to S3 like this:

aws s3 cp customers.json s3:/(bucket name)
 
 aws s3 cp orders.json s3://(bucket name)

Copy S3 data into Redshift

Use these SQL commands to load the data into Redshift. Some items to note:

Use the arn string copied from IAM with the credentials aws_iam_role.
You don’t need to put the region unless your Glue instance is in a different Amazon region than your S3 buckets.
JSON auto means that Redshift will determine the SQL column names from the JSON. Otherwise you would have to create a JSON-to-SQL mapping file.

copy customers
from 's3://gluebmcwalkerrowe/customers.json'
credentials 'aws_iam_role=arn:aws:iam::xxxxxxx:role/Redshift' 
region 'eu-west-3'
json 'auto';

copy orders
from 's3://gluebmcwalkerrowe/orders.json'
credentials 'aws_iam_role=arn:aws:iam::xxxx:role/Redshift' 
region 'eu-west-3'
json 'auto';

Now you can run this query:

select * from orders;

And it will produce this output.

Repeat for customer data as well.

Additional resources

For more on this topic, explore these resources:

BMC Machine Learning & Big Data Blog
AWS Guide, with 15+ articles and tutorials
Amazon Braket Quantum Computing: How To Get Started