28. September 2023 By Attila Papp
Reading Athena views in Glue ETL
Inside Glue ETL, you have several options for reading data from S3 and the Glue Catalog.
However, when you want to read something through an Athena view, you must use the Athena driver. Thankfully, PySpark is just as capable of reading with a JDBC driver as the inbuilt Glue options.
Including the Athena driver
You must supply the `--extra-jars` parameter to include the Athena driver. Here's an example:
"--extra-jars":
"s3://athena-downloads/drivers/JDBC/SimbaAthenaJDBC-2.1.1.1000/AthenaJDBC42-2.1.1.1000.jar"
This could be done from the IaC, CLI, or the GUI under job parameters. The S3 URI above points to the official AWS bucket for the Athena JDBC drivers. You can browse it with:
s3 ls s3://athena-downloads/drivers/JDBC/ to find the latest driver.
Reading through JDBC
To read Athena views through JDBC, you must create a PySpark connection. This is a standard JDBC connection using the Athena driver:
"`Python
raw_data = (
glueContext.read.format("jdbc")
.option("driver", "com.simba.athena.jdbc.Driver")
.option(
"AwsCredentialsProviderClass",
"com.simba.athena.amazonaws.auth.InstanceProfileCredentialsProvider",
)
.option("url", "jdbc:awsathena://athena.eu-central-1.amazonaws.com:443")
.option("dbtable", "AwsDataCatalog.database.table_name")
.option(
"S3OutputLocation",
temp_dir,
)
.load()
)
```
This code reads the `table_name` view from the `database` database in Athena. The `tempDir` option specifies where the query results will be stored. When you specify filters on the Data Frame, they will also be propagated to the Athena query. How cool!
Clearing temp dir
After reading the Athena view, you should clear the temp dir. Here's an example:
"`Python
glueContext.purge_s3_path(temp_dir, {"retentionPeriod": 0})
```
Final code
Finally, let's look at the code altogether:
"`Python
import sys
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.transforms import *
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
args = getResolvedOptions(sys.argv, ["JOB_NAME", "temp_dir"])
job.init(args["JOB_NAME"], args)
temp_dir = args["temp_dir"]
raw_data = (
glueContext.read.format("jdbc")
.option("driver", "com.simba.athena.jdbc.Driver")
.option(
"AwsCredentialsProviderClass",
"com.simba.athena.amazonaws.auth.InstanceProfileCredentialsProvider",
)
.option("url", "jdbc:awsathena://athena.us-east-1.amazonaws.com:443")
.option("dbtable", "AwsDataCatalog.database.table_name")
.option(
"S3OutputLocation",
temp_dir,
)
.load()
)
glueContext.purge_s3_path(temp_dir, {"retentionPeriod": 0})
job.commit()
```
In this quick guide, we looked at reading data from Athena views in Glue ETL PySpark jobs using the Athena driver. It is important to note that reading from S3 or the Glue catalog will always be more performant. However, those options cannot be used for reading from Athena views. So, in use cases where further processing Athena views is required, this is a simple option.