Hi Steve,

Thank you for your answer, much appreciated.

Reading the code seems that:
I have to admit I have not tested (read: debug) it and might not be completely accurate  (checkpointing is not the highest priority), however I have the feeling I can not provide those properties via code because a new configuration gets instantiated/read from system properties and whatever I set to the current running context is ignored (or at least this happens in python).

What do you (or any in the list) think?


On Wed, Aug 2, 2017 at 6:04 PM, Steve Loughran <stevel@hortonworks.com> wrote:

On 2 Aug 2017, at 10:34, Riccardo Ferrari <ferrarir@gmail.com> wrote:

Hi list!

I am working on a pyspark streaming job (ver 2.2.0) and I need to enable checkpointing. At high level my python script goes like this:

class StreamingJob():

def __init__(..):

def doJob(self):
   ssc = StreamingContext.getOrCreate('<S3-location>', <function to create ssc>)

and I run it:

myJob = StreamingJob(...)

The problem is that StreamingContext.getOrCreate is not able to have access to hadoop configuration configured in the constructor and fails to load from checkpoint with

"com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain"

If I export AWS credentials to the system ENV before starting the script it works!

Spark magically copies the env vars over for you when you launch a job

I see the Scala version has an option to provide the hadoop configuration that is not available in python

I don't have the whole Hadoop, just Spark, so I don't really want to configure hadoop's xmls and such

when you set up the context, as in spark-defaults.conf

spark.hadoop.fs.s3a.access.key=access key
spark.hadoop.fs.s3a.secret.key=secret key

Reminder: Do keep your secret key a secret, avoid checking it in to any form of revision control.