Machine Learning & Big Data Blog

How To Define and Run a Job in AWS Glue

Walker Rowe
3 minute read
Walker Rowe
image_pdfimage_print

Here we show how to run a simple job in Amazon Glue.

The basic procedure, which we’ll walk you through, is to:

  • Create a Python script file (or PySpark)
  • Copy it to Amazon S3
  • Give the Amazon Glue user access to that S3 bucket
  • Run the job in AWS Glue
  • Inspect the logs in Amazon CloudWatch

Create Python script

First we create a simple Python script:

arr=[1,2,3,4,5]

for i in range(len(arr)):
    print(arr[i])

Copy to S3

Then use the Amazon CLI to create an S3 bucket and copy the script to that folder.

aws s3 mb s3://movieswalker/jobs
aws s3 cp counter.py s3://movieswalker/jobs

Configure and run job in AWS Glue

Log into the Amazon Glue console. Go to the Jobs tab and add a job. Give it a name and then pick an Amazon Glue role. The role AWSGlueServiceRole-S3IAMRole should already be there. If it is not, add it in IAM and attach it to the user ID you have logged in with. See instructions at the end of this article with regards to the role.

Configure and run job in AWS Glue

The script editor in Amazon Glue lets you change the Python code.

script editor

This screen shows that you can pass run-time parameters to the job:

Run-time Parameters

Run the job. When you run it, if there is any error you are directed to CloudWatch where you can see that. The error below is an S3 permissions error:

S3 Permissions Error

Here is the job run history.

Job Run History

Here is the log showing that the Python code ran successfully. In this simple example it just printed out the numbers 1,2,3,4,5. Click the Logs link to see this log.

Logs

Give Glue user access to S3 bucket

If you have run any of our other tutorials, like running a crawler or joining tables, then you might already have the AWSGlueServiceRole-S3IAMRole. What’s important for running a Glue job is that the role has access to the S3 bucket where the Python script is stored.

In this example, I added that manually using the JSON Editor in the IAM roles screen and pasted in this policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": "*"
        }
    ]
}

If you don’t do this, or do it incorrectly, you will get this error:

File "/usr/local/lib/python3.6/site-packages/botocore/client.py", line 661, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden

Here we show that the user has the AWSGlueserviceRole policy and the S3 policy we just added in the AWSGlueServiceRole-S3IAMRole role. That, of course, must be attached to your IAM userid.

AWSGlueServiceRole-S3IAMRole

Additional resources

Explore these resources:

Automate workflows to simplify your big data lifecycle

In this e-book, you’ll learn how you can automate your entire big data lifecycle from end to end—and cloud to cloud—to deliver insights more quickly, easily, and reliably.


These postings are my own and do not necessarily represent BMC's position, strategies, or opinion.

See an error or have a suggestion? Please let us know by emailing blogs@bmc.com.

BMC Bring the A-Game

From core to cloud to edge, BMC delivers the software and services that enable nearly 10,000 global customers, including 84% of the Forbes Global 100, to thrive in their ongoing evolution to an Autonomous Digital Enterprise.
Learn more about BMC ›

About the author

Walker Rowe

Walker Rowe

Walker Rowe is an American freelancer tech writer and programmer living in Cyprus. He writes tutorials on analytics and big data and specializes in documenting SDKs and APIs. He is the founder of the Hypatia Academy Cyprus, an online school to teach secondary school children programming. You can find Walker here and here.