AWS Glue Guide – BMC Software | Blogs https://s7280.pcdn.co Fri, 22 Jul 2022 13:12:18 +0000 en-US hourly 1 https://s7280.pcdn.co/wp-content/uploads/2016/04/bmc_favicon-300x300-36x36.png AWS Glue Guide – BMC Software | Blogs https://s7280.pcdn.co 32 32 How To Connect Amazon Glue to a JDBC Database https://s7280.pcdn.co/aws-glue-jbdc-database/ Tue, 08 Sep 2020 00:00:01 +0000 https://www.bmc.com/blogs/?p=18560 Here we explain how to connect Amazon Glue to a Java Database Connectivity (JDBC) database. The reason you would do this is to be able to run ETL jobs on data stored in various systems. For example, you could: Read .CSV files stored in S3 and write those to a JDBC database. Write database data […]]]>

Here we explain how to connect Amazon Glue to a Java Database Connectivity (JDBC) database.

The reason you would do this is to be able to run ETL jobs on data stored in various systems. For example, you could:

  • Read .CSV files stored in S3 and write those to a JDBC database.
  • Write database data to Amazon Redshift, JSON, CSV, ORC, Parquet, or Avro files in S3.
  • Once the JDBC database metadata is created, you can write Python or Scala scripts and create Spark dataframes and Glue dynamic frames to do ETL transformations and then save the results.
  • Since a Glue Crawler can span multiple data sources, you can bring disparate data together and join it for purposes of preparing data for machine learning, running other analytics, deduping a file, and doing other data cleansing. However, that is limited by the number of Python packages installed in Glue (you cannot add more) in GluePYSpark.

In this tutorial, we use PostgreSQL running on an EC2 instance. Glue supports Postgres, MySQL, Redshift, and Aurora databases. To use other databases, you would have to provide your own JDBC jar file.

Amazon VPC

Unfortunately, configuring Glue to crawl a JDBC database requires that you understand how to work with Amazon VPC (virtual private clouds). I say unfortunately because application programmers don’t tend to understand networking. Amazon requires this so that your traffic does not go over the public internet.

Fortunately, EC2 creates these network gateways (VPC and subnet) for you when you spin up virtual machines. All you need to do is set the firewall rules in the default security group for your virtual machine.

If you do this step wrong, or skip it entirely, you will get the error:

ERROR : At least one security group must open all ingress ports. To limit traffic, the source security group in your inbound rule can be restricted to the same security group

Glue can only crawl networks in the same AWS region—unless you create your own NAT gateway.

Configure firewall rule

Look at the EC2 instance where your database is running and note the VPC ID and Subnet ID.

Go to Security Groups and pick the default one. You might have to clear out the filter at the top of the screen to find that.

Add an All TCP inbound firewall rule. Then attach the default security group ID.

Amazon Glue security groups

Don’t use your Amazon console root login. Use an IAM user. For all Glue operations they will need: AWSGlueServiceRole and AmazonS3FullAccess or some subset thereof.

Your Glue security rule will look something like this:

arn:aws:iam::(XXXX):role/service-role/AWSGlueServiceRole-S3IAMRole

Create a JDBC connection

In Amazon Glue, create a JDBC connection. It should look something like this:

Type	JDBC
JDBC URL	jdbc:postgresql://xxxxxx:5432/inventory
VPC Id	
vpc-xxxxxxx
Subnet	subnet-xxxxxx
Security groups	sg-xxxxxx
Require SSL connection	false
Description	-
Username	xxxxxxxx
Created	30 August 2020 9:37 AM UTC+3
Last modified	30 August 2020 4:01 PM UTC+3

Define crawler

Create a Glue database. This is basically just a name with no other parameters, in Glue, so it’s not really a database.

Next, define a crawler to run against the JDBC database. The include path is the database/table in the case of PostgreSQL.

For other databases, look up the JDBC connection string.

Run the crawler

Then you run the crawler, it provides a link to the logs stored in CloudWatch. Look there for errors or success.

If you have done everything correctly, it will generate metadata in tables in the database. This is not data. It’s just a schema for your tables.

Additional resources

For more tutorials like this, explore these resources:

]]>
How To Run Machine Learning Transforms in AWS Glue https://www.bmc.com/blogs/aws-glue-machine-learning-transforms/ Mon, 07 Sep 2020 00:00:38 +0000 https://www.bmc.com/blogs/?p=18540 Here we show you how to do a machine learning transformation with Amazon Glue. Previous Glue tutorials include: How To Make a Crawler in Amazon Glue How To Join Tables in Amazon Glue How To Define and Run a Job in AWS Glue AWS Glue ETL Transformations Now, let’s get started. Amazon’s machine learning A […]]]>

Here we show you how to do a machine learning transformation with Amazon Glue. Previous Glue tutorials include:

Now, let’s get started.

Amazon’s machine learning

A fully managed service from Amazon, AWS Glue handles data operations like ETL to get your data prepared and loaded for analytics activities. Glue can crawl S3, DynamoDB, and JDBC data sources.

Amazon called their offering machine learning, but they only have one ML-type function, findMatches. It uses an ML algorithm, but Amazon does not tell you which one. They even boast on their web page you don’t need to know—but a data scientist would certainly want to know.

You can study their execution log to gain some insight into what their code is doing. Suffice it to say it is doing a type of clustering algorithm and using Apache Spark as a platform to execute that.

The process: Amazon Glue machine learning

Here is the general process for running machine learning transformations:

  1. Upload a csv file to an S3 bucket. Then you set up a crawler to crawl all the files in the designated S3 bucket. For each file it finds, it will create a metadata (i.e., schema) file in Glue that contains the column names.
  2. Set up a FindMatches machine learning task in Glue. It’s an iterative process. It takes your input date, created in the crawler process, and makes a label file. These labels are like a k-means clustering algorithm. It looks at the input data and all of the columns in the data set. Then it put the data into groups, each labeled with a labeling_set_id.
  3. Download the label file. There will be an empty column called label. You are invited to add your own label to classify data however you see fit. For example, it could be borrower risk rating, whether or not a patient has diabetes, or whatever. Labels should be a single value, like A, B, C or 1, 2, 3. A data scientist would say they must be categorical.
  4. Upload the labelled file to a different S3 bucket. Do not use the same bucket where you put the original input data, as the crawler will attempt to crawl that and create another metadata file.
  5. Rerun Step 2, above, and it creates another labelled file. Do this iteratively until it supplies the most accurate result. In this example, there was no improvement from one run to the next. Repeating machine learning runs is standard practice for improving accuracy. However, at some point, the gain in accuracy will level off.
  6. Generate and then inspect the Quality Metrics. Perhaps change some of the parameters and run the Tune operation, which means to run the algorithm again.

Tutorial: Amazon Glue machine learning

Now, let’s run an example to show you how it works.

I have copied the Pima Native American database from Kaggle and put it on GitHub, here. You have to add a primary key column to that data, which Glue requires. Download the data here. I have also copied the input data and the first and second label files here, in a Google Sheet, so that you can see the before and after process.

The data looks like this:

recordID,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
2,6,148,72,35,0,33.6,0.627,50,1
3,1,85,66,29,0,26.6,0.351,31,0
4,8,183,64,0,0,23.3,0.672,32,1
5,1,89,66,23,94,28.1,0.167,21,0
6,0,137,40,35,168,43.1,2.288,33,1

Then copy it to an Amazon S3 bucket as shown below. You need to have installed the Amazon CLI (command line interface) and run aws configure to configure your credentials. Importantly, the data must be in the same Amazon zone as the instance you are logged into.

aws s3 cp diabetes.csv s3://sagemakerwalkerml

Add a label

The diabetes data is already labelled in the column outcome. So, I used Google Sheets to copy that value into the label column. You do this after the first run, like this:

  1. Upload the original data.
  2. Run a training model
  3. Download the resulting labels file.

At that point you can populate the label with some kind of categorical data. You might put the outcome of logistic regression on your input data set into this label, but that’s optional. You don’t need a label at all.

The algorithm does not require a label the first time it runs. Glue says:

As you can see, the scope of the labels is limited to the labeling_set_id. So, labels do not cross labeling_set_id boundaries.

In other words, when there is no label, it groups records by labeling_set_id without regards to the label value. When there is a label then the labeling_set_id is within the label.

In other words, given this:

labeling_set_id labeling_set_id other columns
123 blank
123 blank
456 blank

The first two rows are grouped together. But if we add a label:

labeling_set_id label other columns
123 A
123 A
456 B
456 B
789 A

Then the first two rows are matched together while rows 456, even if they had matched without the label, are groups separately. Remember Amazon said they “don’t cross boundaries.”

Of course, that does not mean only the label determines what is considered to be a match. (That would be of little use.) It’s the other columns that determine what matches. The label just confines that matching to records with that label. So, it’s matching within a subset of records. It’s like having n number of files with no label instead of one file with n labels, so you can run the process one time and not n times.

Anyway, that’s the conclusion I draw from this design. Perhaps yours will differ.

Crawl S3

We start with the crawlers. Here is the metadata extracted from the diabetes.csv file in S3:

It created these tables in the database.

Pick an IAM role that has access to S3 and give the transformation a name.

The data must have a primary key. The matching algorithm requires that to do its matching logic.

Then it asks you to tune the transformation. These are tradeoffs between cost and accuracy:

  • Cost is financial.
  • Cost function is data science and computing.

(Pricing is based on resources (DPUs) you consume, which I cover below.)

The data science-related tuning parameters are between recall and precision.

Recall is:

Precision is:

Here is a summary of the parameters:

  • The first time let it generate a label file for you. It will match records based on all of the data points taken together.
  • The second time it will incorporate labels in its matching algorithm should you choose to add one.

Edit the file in Excel or Google Sheets to both review it and optionally add a label. Copy it back to S3, putting it in a different bucket than the original upload file. Then run transformation again (called train). It will produce yet another label file which is the results of the matching aka grouping process.

It asks for the bucket name:

You download the labels from this screen.

Here is the first label file it created. You can’t see all of the columns because it’s too wide. But you can see the labeling_set_id, thus how it grouped the data:

Evaluation metrics

This screen lets you calculate accuracy. I have yet to figure out where you can see the results as the screen mentioned in the documentation does not exist. (I will update this tutorial once I get a response on the user forum.)

Pricing

Price is by DPU. I used 10 DPUs for about 30 minutes. It’s $0.44 for each multiple or fraction of an hour. So presumably I spent $0.44*10=$4.40.

These ETL jobs run on Amazon’s Spark and Yarn infrastructure. If you want to write code to do transformations you need to set up a Development Endpoint. Basically, the development endpoint is a virtual machine configured to run Spark and Glue. We explained how to use a Development Endpoint here. Then you can run Python or Scala and optionally use a Jupyter Notebook.

Important note: When you don’t need your development endpoint, be sure to delete it—it gets expensive quickly! (I spent $1,200 on that in a month.)

Additional resources

For more tutorials like this, explore these resources:

]]>
AWS Glue ETL Transformations https://www.bmc.com/blogs/aws-glue-etl-transformations/ Fri, 21 Aug 2020 00:00:17 +0000 https://www.bmc.com/blogs/?p=18358 In this article, we explain how to do ETL transformations in Amazon’s Glue. For background material please consult How To Join Tables in AWS Glue. You first need to set up the crawlers in order to create some data. By this point you should have created a titles DynamicFrame using this code below. Now we […]]]>

In this article, we explain how to do ETL transformations in Amazon’s Glue. For background material please consult How To Join Tables in AWS Glue. You first need to set up the crawlers in order to create some data.

By this point you should have created a titles DynamicFrame using this code below. Now we can show some ETL transformations.

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *


glueContext = GlueContext(SparkContext.getOrCreate())

titles = glueContext.create_dynamic_frame.from_catalog(database="moviesandratings", table_name="movieswalker")

Select fields

This ETL transformation creates a new DynamicFrame by taking the fields in the paths list. We use toDF().show() to turn it into Spark Dataframe and print the results.

titles.select_fields(paths=["tconst","primaryTitle"]).toDF().show()

Map

The map function iterates over every record (called a DynamicRecord) in the DynamicFrame and runs a function over it.

First create a function that takes a DynamicRecord as an argument and returns the DynamicRecord. Here we take one column and make it uppercase:

 
def upper(rec):
    rec["tconst"]=rec["tconst"].upper()
       
    return rec 

Then call that function on the DynamicFrame titles.

Map.apply(frame=titles,f=upper).toDF().show()

Apply mapping

This method changes column names and types. mappings is an array of tuples ("oldName", "oldType", "newName", "newType").

DynamicFrameCollection

A Dynamic Frame collection is a dictionary of Dynamic Frames. We can create one using the split_fields function. Then you can run the same map, flatmap, and other functions on the collection object. Glue provides methods for the collection so that you don’t need to loop through the dictionary keys to do that individually.

Here we create a DynamicFrame Collection named dfc. The first DynamicFrame splitoff has the columns tconst and primaryTitle. The second DynamicFrame remaining holds the remaining columns.

dfc=titles.split_fields(paths=["tconst","primaryTitle"],name1="splitoff",name2="remaining")
>>> dfc.keys()
dict_keys(['splitoff', 'remaining'])

We show the DynamicFrame splitoff below

dfc['splitoff'].toDF().show()
+----------+-------------------+                                                
|    tconst|       primaryTitle|
+----------+-------------------+
| tt0276132|      The Fetishist|
| tt0279481|        Travel Daze|
| tt0305295|        Bich bozhiy|

Create a Dynamic DataFrame from a Spark DataFrame

As we can turn DynamicFrames into Spark Dataframes, we can go the other way around. We can create data by first creating a Spark Dataframe and then using the fromDF function.

We use the Apache Spark SQL Row object.

 
from pyspark.sql import *
 
walker = Row(name='Walker',age=59)
stephen = Row(name='Stephen', age=40)
students=[walker,stephen] 

dfc=spark.createDataFrame(students).fromDF

Additional resources

Explore these resources:

]]>
How To Define and Run a Job in AWS Glue https://www.bmc.com/blogs/aws-glue-run-jobs/ Thu, 20 Aug 2020 15:28:35 +0000 https://www.bmc.com/blogs/?p=18384 Here we show how to run a simple job in Amazon Glue. The basic procedure, which we’ll walk you through, is to: Create a Python script file (or PySpark) Copy it to Amazon S3 Give the Amazon Glue user access to that S3 bucket Run the job in AWS Glue Inspect the logs in Amazon […]]]>

Here we show how to run a simple job in Amazon Glue.

The basic procedure, which we’ll walk you through, is to:

  • Create a Python script file (or PySpark)
  • Copy it to Amazon S3
  • Give the Amazon Glue user access to that S3 bucket
  • Run the job in AWS Glue
  • Inspect the logs in Amazon CloudWatch

Create Python script

First we create a simple Python script:

arr=[1,2,3,4,5]

for i in range(len(arr)):
    print(arr[i])

Copy to S3

Then use the Amazon CLI to create an S3 bucket and copy the script to that folder.

aws s3 mb s3://movieswalker/jobs
aws s3 cp counter.py s3://movieswalker/jobs

Configure and run job in AWS Glue

Log into the Amazon Glue console. Go to the Jobs tab and add a job. Give it a name and then pick an Amazon Glue role. The role AWSGlueServiceRole-S3IAMRole should already be there. If it is not, add it in IAM and attach it to the user ID you have logged in with. See instructions at the end of this article with regards to the role.

Configure and run job in AWS Glue

The script editor in Amazon Glue lets you change the Python code.

script editor

This screen shows that you can pass run-time parameters to the job:

Run-time Parameters

Run the job. When you run it, if there is any error you are directed to CloudWatch where you can see that. The error below is an S3 permissions error:

S3 Permissions Error

Here is the job run history.

Job Run History

Here is the log showing that the Python code ran successfully. In this simple example it just printed out the numbers 1,2,3,4,5. Click the Logs link to see this log.

Logs

Give Glue user access to S3 bucket

If you have run any of our other tutorials, like running a crawler or joining tables, then you might already have the AWSGlueServiceRole-S3IAMRole. What’s important for running a Glue job is that the role has access to the S3 bucket where the Python script is stored.

In this example, I added that manually using the JSON Editor in the IAM roles screen and pasted in this policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": "*"
        }
    ]
}

If you don’t do this, or do it incorrectly, you will get this error:

File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 661, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

Here we show that the user has the AWSGlueserviceRole policy and the S3 policy we just added in the AWSGlueServiceRole-S3IAMRole role. That, of course, must be attached to your IAM userid.

AWSGlueServiceRole-S3IAMRole

Additional resources

Explore these resources:

]]>
How To Join Tables in Amazon Glue https://www.bmc.com/blogs/amazon-glue-join-tables/ Fri, 14 Aug 2020 00:00:21 +0000 https://www.bmc.com/blogs/?p=18298 Here we show how to join two tables in Amazon Glue. We make a crawler and then write Python code to create a Glue Dynamic Dataframe to join the two tables. First, we’ll share some information on how joins work in Glue, then we’ll move onto the tutorial. You can start with the basics on […]]]>

Here we show how to join two tables in Amazon Glue. We make a crawler and then write Python code to create a Glue Dynamic Dataframe to join the two tables.

First, we’ll share some information on how joins work in Glue, then we’ll move onto the tutorial. You can start with the basics on Amazon Glue Crawlers, but we are going to modify the procedure described there to fit the data we have prepared below.

Brief intro to Amazon Glue

Glue is not a database. It basically contains nothing but metadata. You point it at a data source and it vacuums up the schema. Or you create the schema manually. The data exists in

Glue processes data sets using Apache Spark, which is an in-memory database. Then you can write the resulting data out to S3 or mysql, PostgreSQL, Amazon Redshift, SQL Server, or Oracle.

Glue can crawl these data types:

  • JSON
  • CSV
  • Parquet
  • Avro
  • XML

What is a join?

First, join means to take two tables and join them by a common element. Joining two tables is an important step in lots of ETL operations.

A join is a SQL operation that you could not perform on most noSQL databases, like DynamoDB or MongoDB. noSQL databases don’t usually allow joins because it is an expensive operation that takes a lot of time, disk space, and memory.

Amazon Glue joins

Glue does the joins using Apache Spark, which runs in memory.
In this example, it pulls JSON data from S3 and uses the metadata schema created by the crawler to identify the attributes in the files so that it can work with those.

Set up Amazon Glue Crawler in S3 to get sample data

We will use a small subset of the IMDB database (just seven records). We have converted the data to JSON format and put in on S3. First check that you can open these files by opening one of each of these:

The movie titles look like this:

{"tconst":  "tt0276132","titleType":  "movie","primaryTitle":  "The Fetishist","originalTitle":  "The Fetishist","isAdult":  "0","startYear":  "2019","endYear":  "\\N","runtimeMinutes":  "\\N","genres":  "Animation"}

The ratings look this this:

{"tconst": "tt0305295", "averageRating": "6.1", "numVotes": "16"}

The goal is to rate movies and TV shows. We have to do that with a join operation since the rating and the title are in separate datasets. The common element is the unique element tconst.

Set up a crawler in Amazon Glue and crawl these two folders:

  • s3://walkerimdbratings
  • s3://movieswalker/

Make sure you select Create SIngle Schema so that it makes just one table for each S3 folder and not one for each file.

Start Amazon Glue Virtual Machine

Glue is nothing more than a virtual machine running Spark and Glue. We are using it here using the Glue PySpark CLI. PySpark is the Spark Python shell. You can also attach a Zeppelin notebook to it or perform limited operations on the web site, like creating the database. And you can use Scala.

Glue supports two languages: Scala and Python. That’s because it rides a top Apache Spark, which supports those two languages as well—and, for the most part, only those two. Glue is basically an Apache Spark instance with Glue libraries attached.

Set up The Development Endpoint

Next, set a billing alarm in your Amazon AWS account. When you start an endpoint, you will incur charges from Amazon, since it’s a virtual machine. (You can download Glue and use it on a local machine if you don’t want to incur charges. But then you can’t use the GUI.)

Fill out these screens from the Glue console as follows. You will have to create a new public key in order to access the Glue VM from ssh. You cannot use the root Amazon credentials.

Once the endpoint is created you change the path to point to your public key and open the shell using the URL Amazon gave you using ssh:

ssh -i /home/ubuntu/.ssh/glue glue@ec2-15-236-145-246.eu-west-3.compute.amazonaws.com -t gluepyspark3

That will open PySpark, which will be familiar to those who have used Apache Spark.

Python 3.6.11 (default, Jul 20 2020, 22:15:17) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Type "help", "copyright", "credits" or "license" for more information.
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/share/aws/glue/etl/jars/glue-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
2020-08-06 10:03:08,828 WARN  [Thread-5] yarn.Client (Logging.scala:logWarning(66)) - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.

Use Python code to join the tables

The code below is here. Basically, the code creates two Glue Dynamic Frames. Then it creates a Spark Dataframe. Then we use the Join function to connect the two on the common element tconst.

The first step in an Apache Spark program is to get a SparkContext, meaning connect to an instance of Spark:

glueContext = GlueContext(SparkContext.getOrCreate())

Next we create Dynamic Dataframes. Those are Glue objects that don’t exist in Spark.

The database you created manually in the GUI. The crawler created the table names with the same name as the S3 buckets.

titles = glueContext.create_dynamic_frame.from_catalog(database="moviesandratings", table_name="movieswalker")
 

ratings = glueContext.create_dynamic_frame.from_catalog(database="moviesandratings", table_name="walkerimdbratings")

Now we create a new Dynamic Dataframe using the Join object. You put the names of the two Dataframes to join and their common attributes, i.e., primary key field.

ratingsTitles =   Join.apply(titles, ratings, 'tconst','tconst')

Then we convert that to a Spark Dataframe with toDF() so that we can use the select() method to pick the title and rating from the joined data.

ratingsTitles.toDF().select(['originalTitle','averageRating']).show()

The result is:

+-------------------+-------------+                                             
|      originalTitle|averageRating|
+-------------------+-------------+
|Motherless Brooklyn|          6.8|
|       Carnival Row|          7.9|
|      Cine Manifest|          7.2|
|       Pet Sematary|          5.7|
|           The Dirt|          7.0|
|         Dirt Music|          5.3|
|        Bich bozhiy|          6.1|
+-------------------+-------------+

The complete code

Here is the complete code:

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.transforms import *

from awsglue.transforms import Join


glueContext = GlueContext(SparkContext.getOrCreate())


titles = glueContext.create_dynamic_frame.from_catalog(database="moviesandratings", table_name="movieswalker")
 

ratings = glueContext.create_dynamic_frame.from_catalog(database="moviesandratings", table_name="walkerimdbratings")
 
ratingsTitles =   Join.apply(titles, ratings, 'tconst','tconst')
 ratingsTitles.toDF().select(['originalTitle','averageRating']).show()



Additional resources

Explore these resources:

]]>
How To Make a Crawler in Amazon Glue https://www.bmc.com/blogs/amazon-glue-crawler/ Thu, 06 Aug 2020 00:00:32 +0000 https://www.bmc.com/blogs/?p=18234 In this tutorial, we show how to make a crawler in Amazon Glue. A fully managed service from Amazon, AWS Glue handles data operations like ETL (extract, transform, load) to get the data prepared and loaded for analytics activities. Glue can crawl S3, DynamoDB, and JDBC data sources. What is a crawler? A crawler is […]]]>

In this tutorial, we show how to make a crawler in Amazon Glue.

A fully managed service from Amazon, AWS Glue handles data operations like ETL (extract, transform, load) to get the data prepared and loaded for analytics activities. Glue can crawl S3, DynamoDB, and JDBC data sources.

What is a crawler?

A crawler is a job defined in Amazon Glue. It crawls databases and buckets in S3 and then creates tables in Amazon Glue together with their schema.

Then, you can perform your data operations in Glue, like ETL.

Sample data

We need some sample data. Because we want to show how to join data in Glue, we need to have two data sets that have a common element.

The data we use is from IMDB. We have selected a small subset (24 records) of that data and put it into JSON format. (Specifically, they have been formatted to load into DynamoDB, which we will do later.)

One file has the description of a movie or TV series. The other has ratings on that series or movie. Since the data is in two files, it is necessary to join that data in order to get ratings by title. Glue can do that.

Download these two JSON data files:

  • Download title data here.
  • Download ratings data here.
wget https://raw.githubusercontent.com/werowe/dynamodb/master/100.basics.json
wget https://raw.githubusercontent.com/werowe/dynamodb/master/100.ratings.tsv.json

Upload the data to Amazon S3

Create these buckets in S3 using the Amazon AWS command line client. (Don’t forget to run aws configure to store your private key and secret on your computer so you can access Amazon AWS.)

Below we create the buckets titles and rating inside movieswalker. The reason for this is Glue will create a separate table schema if we put that data in separate buckets.

(Your top-level bucket name must be unique across all of Amazon. That’s an Amazon requirement, since you refer to the bucket by URL. No two customers can have the same URL.)

aws s3 mb s3://movieswalker
aws s3 mb s3://movieswalker/titles
aws s3 mb s3://movieswalker/ratings

Then copy the title basics and ratings file to their respective buckets.

 
aws s3 cp 100.basics.json s3://movieswalker/titles
aws s3 cp 100.ratings.tsv.json s3://movieswalker/ratings

Configure the crawler in Glue

Log into the Glue console for your AWS region. (Mine is European West.)

Then go to the crawler screen and add a crawler:

Next, pick a data store. A better name would be data source, since we are pulling data from there and storing it in Glue.

Then pick the top-level movieswalker folder we created above.

Notice that the data store can be S3, DynamoDB, or JDBC.

Then start the crawler. When it’s done you can look at the logs.

If you get this error it’s an S3 policy error. You can make the tables public just for purposes of this tutorial if you don’t want to dig into IAM policies. In this case, I got this error because I uploaded the files as the Amazon root user while I tried to access it using a user created with IAM.

ERROR : Error Access Denied (Service: Amazon S3; Status Code: 403; Error Code: AccessDenied; Request ID: 16BA170244C85551; S3 Extended Request ID: y/JBUpMqsdtf/vnugyFZp8k/DK2cr2hldoXP2JY19NkD39xiTEFp/R8M+UkdO5X1SjrYXuJOnXA=) retrieving file at s3://movieswalker/100.basics.json. Tables created did not infer schemas from this file.

View the crawler log. Here you can see each step of the process.

View tables created in Glue

Here are the tables created in Glue.

If you click on them you can see the schema.

It has these properties. The item of interest to note here is it stored the data in Hive format, meaning it must be using Hadoop to store that.

{
	"StorageDescriptor": {
		"cols": {
			"FieldSchema": [
				{
					"name": "title",
					"type": "array<struct<PutRequest:struct<Item:struct<tconst:struct,titleType:struct,primaryTitle:struct,originalTitle:struct,isAdult:struct,startYear:struct,endYear:struct,runtimeMinutes:struct,genres:struct>>>>",
					"comment": ""
				}
			]
		},
		"location": "s3://movieswalker/100.basics.json",
		"inputFormat": "org.apache.hadoop.mapred.TextInputFormat",
		"outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
		"compressed": "false",
		"numBuckets": "-1",
		"SerDeInfo": {
			"name": "",
			"serializationLib": "org.openx.data.jsonserde.JsonSerDe",
			"parameters": {
				"paths": "title"
			}
		},
		"bucketCols": [],
		"sortCols": [],
		"parameters": {
			"sizeKey": "7120",
			"UPDATED_BY_CRAWLER": "S3 Movies",
			"CrawlerSchemaSerializerVersion": "1.0",
			"recordCount": "1",
			"averageRecordSize": "7120",
			"CrawlerSchemaDeserializerVersion": "1.0",
			"compressionType": "none",
			"classification": "json",
			"typeOfData": "file"
		},
		"SkewedInfo": {},
		"storedAsSubDirectories": "false"
	},
	"parameters": {
		"sizeKey": "7120",
		"UPDATED_BY_CRAWLER": "S3 Movies",
		"CrawlerSchemaSerializerVersion": "1.0",
		"recordCount": "1",
		"averageRecordSize": "7120",
		"CrawlerSchemaDeserializerVersion": "1.0",
		"compressionType": "none",
		"classification": "json",
		"typeOfData": "file"
	}
}

Additional resources

For more on this topic, explore these resources:

]]>