spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Steve Loughran <ste...@hortonworks.com>
Subject Re: SPARK Issue in Standalone cluster
Date Fri, 04 Aug 2017 10:50:55 GMT

> On 3 Aug 2017, at 19:59, Marco Mistroni <mmistroni@gmail.com> wrote:
> 
> Hello
>  my 2 cents here, hope it helps
> If you want to just to play around with Spark, i'd leave Hadoop out, it's an unnecessary
dependency that you dont need for just running a python script
> Instead do the following:
> - got to the root of our master / slave node. create a directory /root/pyscripts 
> - place your csv file there as well as the python script
> - run the script to replicate the whole directory  across the cluster (i believe it's
called copy-script.sh)
> - then run your spark-submit , it will be something lke
>     ./spark-submit /root/pyscripts/mysparkscripts.py  file:///root/pyscripts/tree_addhealth.csv
10 --master spark://ip-172-31-44-155.us-west-2.compute.internal:7077
> - in your python script, as part of your processing, write the parquet file in directory
/root/pyscripts 
> 

That's going to hit the commit problem discussed: only the spark driver executes the final
commit process; the output from the other servers doesn't get picked up and promoted. You
need a shared stpre (NFS is the easy one)


> If you have an AWS account and you are versatile with that - you need to setup bucket
permissions etc - , you can just
> - store your file in one of your S3 bucket
> - create an EMR cluster
> - connect to master or slave
> - run your  scritp that reads from the s3 bucket and write to the same s3 bucket


Aah, and now we are into the problem of implementing a safe commit protocol for an inconsistent
filesystem....

My current stance there is out-the-box S3 isn't safe to use as the direct output of work,
Azure is. It mostly works for a small experiment, but I wouldn't use it in production.

Simplest: work on one machine, if you go to 2-3 for exploratory work: NFS


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message