Reading Athena views in Glue ETL

Inside Glue ETL, you have several options for reading data from S3 and the Glue Catalog.

However, when you want to read something through an Athena view, you must use the Athena driver. Thankfully, PySpark is just as capable of reading with a JDBC driver as the inbuilt Glue options.

Including the Athena driver

You must supply the `--extra-jars` parameter to include the Athena driver. Here's an example:

"--extra-jars":
"s3://athena-downloads/drivers/JDBC/SimbaAthenaJDBC-2.1.1.1000/AthenaJDBC42-2.1.1.1000.jar"

This could be done from the IaC, CLI, or the GUI under job parameters. The S3 URI above points to the official AWS bucket for the Athena JDBC drivers. You can browse it with:

s3 ls s3://athena-downloads/drivers/JDBC/ to find the latest driver.

Reading through JDBC

To read Athena views through JDBC, you must create a PySpark connection. This is a standard JDBC connection using the Athena driver:

"`Python

	
	    raw_data = (
	        glueContext.read.format("jdbc")
	        .option("driver", "com.simba.athena.jdbc.Driver")
	        .option(
	            "AwsCredentialsProviderClass",
	            "com.simba.athena.amazonaws.auth.InstanceProfileCredentialsProvider",
	        )
	        .option("url", "jdbc:awsathena://athena.eu-central-1.amazonaws.com:443")
	        .option("dbtable", "AwsDataCatalog.database.table_name")
	        .option(
	            "S3OutputLocation",
	            temp_dir,
	        )
	        .load()
	    )

```

This code reads the `table_name` view from the `database` database in Athena. The `tempDir` option specifies where the query results will be stored. When you specify filters on the Data Frame, they will also be propagated to the Athena query. How cool!

Clearing temp dir

After reading the Athena view, you should clear the temp dir. Here's an example:

"`Python

	
	glueContext.purge_s3_path(temp_dir, {"retentionPeriod": 0})

```

Final code

Finally, let's look at the code altogether:

"`Python

	
	import sys
	from awsglue.context import GlueContext
	from awsglue.job import Job
	from awsglue.utils import getResolvedOptions
	from pyspark.context import SparkContext
	from awsglue.transforms import *
	sc = SparkContext.getOrCreate()
	glueContext = GlueContext(sc)
	spark = glueContext.spark_session
	job = Job(glueContext)
	args = getResolvedOptions(sys.argv, ["JOB_NAME", "temp_dir"])
	job.init(args["JOB_NAME"], args)
	temp_dir = args["temp_dir"]
	raw_data = (
	    glueContext.read.format("jdbc")
	    .option("driver", "com.simba.athena.jdbc.Driver")
	    .option(
	        "AwsCredentialsProviderClass",
	        "com.simba.athena.amazonaws.auth.InstanceProfileCredentialsProvider",
	    )
	    .option("url", "jdbc:awsathena://athena.us-east-1.amazonaws.com:443")
	    .option("dbtable", "AwsDataCatalog.database.table_name")
	    .option(
	        "S3OutputLocation",
	        temp_dir,
	    )
	    .load()
	)
	glueContext.purge_s3_path(temp_dir, {"retentionPeriod": 0})
	job.commit()

```

In this quick guide, we looked at reading data from Athena views in Glue ETL PySpark jobs using the Athena driver. It is important to note that reading from S3 or the Glue catalog will always be more performant. However, those options cannot be used for reading from Athena views. So, in use cases where further processing Athena views is required, this is a simple option.

Author Attila Papp

Attila Papp works as a Solution Architect for adesso.

Category:	Architecture
Tags:	Amazon Web Services (AWS)

Our blog posts at a glance

Our tech blog invites you to dive deep into the exciting dimensions of technology. Here we offer you insights not only into our vision and expertise, but also into the latest trends, developments and ideas shaping the tech world.

Our blog is your platform for inspiring stories, informative articles and practical insights. Whether you are a tech lover, an entrepreneur looking for innovative solutions or just curious - we have something for everyone.

To the blog posts