spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrew Or <and...@databricks.com>
Subject Re: spark-submit --py-files remote: "Only local additional python files are supported"
Date Tue, 20 Jan 2015 18:40:39 GMT
Hi Vladimir,

Yes, as the error messages suggests, PySpark currently only supports local
files. This does not mean it only runs in local mode, however; you can
still run PySpark on any cluster manager (though only in client mode). All
this means is that your python files must be on your local file system.
Until this is supported, the straightforward workaround then is to just
copy the files to your local machine.

-Andrew

2015-01-20 7:38 GMT-08:00 Vladimir Grigor <vladimir@kiosked.com>:

> Hi all!
>
> I found this problem when I tried running python application on Amazon's
> EMR yarn cluster.
>
> It is possible to run bundled example applications on EMR but I cannot
> figure out how to run a little bit more complex python application which
> depends on some other python scripts. I tried adding those files with
> '--py-files' and it works fine in local mode but it fails and gives me
> following message when run in EMR:
> "Error: Only local python files are supported:
> s3://pathtomybucket/mylibrary.py".
>
> Simplest way to reproduce in local:
> bin/spark-submit --py-files s3://whatever.path.com/library.py main.py
>
> Actual commands to run it in EMR
> #launch cluster
> aws emr create-cluster --name SparkCluster --ami-version 3.3.1
> --instance-type m1.medium --instance-count 2  --ec2-attributes
> KeyName=key20141114 --log-uri s3://pathtomybucket/cluster_logs
> --enable-debugging --use-default-roles  --bootstrap-action
> Name=Spark,Path=s3://pathtomybucket/bootstrap-actions/spark/install-spark,Args=["-s","
> http://pathtomybucket/bootstrap-actions/spark
> ","-l","WARN","-v","1.2","-b","2014121700","-x"]
> #{
> #   "ClusterId": "j-2Y58DME79MPQJ"
> #}
>
> #run application
> aws emr add-steps --cluster-id "j-2Y58DME79MPQJ" --steps
> ActionOnFailure=CONTINUE,Name=SparkPy,Jar=s3://eu-west-1.elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--py-files,s3://pathtomybucket/tasks/demo/main.py,main.py]
> #{
> #    "StepIds": [
> #        "s-2UP4PP75YX0KU"
> #    ]
> #}
> And in stderr of that step I get "Error: Only local python files are
> supported: s3://pathtomybucket/tasks/demo/main.py".
>
> What is the workaround or correct way to do it? Using hadoop's distcp to
> copy dependency files from s3 to nodes as another pre-step?
>
> Regards, Vladimir
>

Mime
View raw message