spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Riccardo Ferrari <>
Subject PySpark Streaming S3 checkpointing
Date Wed, 02 Aug 2017 09:34:15 GMT
Hi list!

I am working on a pyspark streaming job (ver 2.2.0) and I need to enable
checkpointing. At high level my python script goes like this:

class StreamingJob():

def __init__(..):

def doJob(self):
   ssc = StreamingContext.getOrCreate('<S3-location>', <function to create

and I run it:

myJob = StreamingJob(...)

The problem is that StreamingContext.getOrCreate is not able to have access
to hadoop configuration configured in the constructor and fails to load
from checkpoint with

"com.amazonaws.AmazonClientException: Unable to load AWS credentials from
any provider in the chain"

If I export AWS credentials to the system ENV before starting the script it

I see the Scala version has an option to provide the hadoop configuration
that is not available in python

I don't have the whole Hadoop, just Spark, so I don't really want to
configure hadoop's xmls and such

What is the cleanest way to achieve my goal?


View raw message