spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Felix Cheung <felixcheun...@hotmail.com>
Subject Re: Spark on K8s - using files fetched by init-container?
Date Tue, 27 Feb 2018 09:24:36 GMT
Yes you were pointing to HDFS on a loopback address...

________________________________
From: Jenna Hoole <jenna.hoole@gmail.com>
Sent: Monday, February 26, 2018 1:11:35 PM
To: Yinan Li; user@spark.apache.org
Subject: Re: Spark on K8s - using files fetched by init-container?

Oh, duh. I completely forgot that file:// is a prefix I can use. Up and running now :)

Thank you so much!
Jenna

On Mon, Feb 26, 2018 at 1:00 PM, Yinan Li <liyinan926@gmail.com<mailto:liyinan926@gmail.com>>
wrote:
OK, it looks like you will need to use `file:///var/spark-data/spark-files/flights.csv` instead.
The 'file://' scheme must be explicitly used as it seems it defaults to 'hdfs' in your setup.

On Mon, Feb 26, 2018 at 12:57 PM, Jenna Hoole <jenna.hoole@gmail.com<mailto:jenna.hoole@gmail.com>>
wrote:
Thank you for the quick response! However, I'm still having problems.

When I try to look for /var/spark-data/spark-files/flights.csv I get told:

Error: Error in loadDF : analysis error - Path does not exist: hdfs://192.168.0.1:8020/var/spark-data/spark-files/flights.csv<http://192.168.0.1:8020/var/spark-data/spark-files/flights.csv>;

Execution halted

Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited
with 1

at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

at org.apache.spark.deploy.RRunner.main(RRunner.scala)

And when I try to look for local:///var/spark-data/spark-files/flights.csv, I get:

Error in file(file, "rt") : cannot open the connection

Calls: read.csv -> read.table -> file

In addition: Warning message:

In file(file, "rt") :

  cannot open file 'local:///var/spark-data/spark-files/flights.csv': No such file or directory

Execution halted

Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited
with 1

at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

at org.apache.spark.deploy.RRunner.main(RRunner.scala)

I can see from a kubectl describe that the directory is getting mounted.

    Mounts:

      /etc/hadoop/conf from hadoop-properties (rw)

      /var/run/secrets/kubernetes.io/serviceaccount<http://kubernetes.io/serviceaccount>
from spark-token-pxz79 (ro)

      /var/spark-data/spark-files from download-files (rw)

      /var/spark-data/spark-jars from download-jars-volume (rw)

      /var/spark/tmp from spark-local-dir-0-tmp (rw)

Is there something else I need to be doing in my set up?

Thanks,
Jenna

On Mon, Feb 26, 2018 at 12:02 PM, Yinan Li <liyinan926@gmail.com<mailto:liyinan926@gmail.com>>
wrote:
The files specified through --files are localized by the init-container to /var/spark-data/spark-files
by default. So in your case, the file should be located at /var/spark-data/spark-files/flights.csv
locally in the container.

On Mon, Feb 26, 2018 at 10:51 AM, Jenna Hoole <jenna.hoole@gmail.com<mailto:jenna.hoole@gmail.com>>
wrote:
This is probably stupid user error, but I can't for the life of me figure out how to access
the files that are staged by the init-container.

I'm trying to run the SparkR example data-manipulation.R which requires the path to its datafile.
I supply the hdfs location via --files and then the full hdfs path.


--files hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv>
local:///opt/spark/examples/src/main/r/data-manipulation.R hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv>

The init-container seems to load my file.

18/02/26 18:29:09 INFO spark.SparkContext: Added file hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv>
at hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv>
with timestamp 1519669749519

18/02/26 18:29:09 INFO util.Utils: Fetching hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv>
to /var/spark/tmp/spark-d943dae6-9b95-4df0-87a3-9f7978d6d4d2/userFiles-4112b7aa-b9e7-47a9-bcbc-7f7a01f93e38/fetchFileTemp7872615076522023165.tmp

However, I get an error that my file does not exist.

Error in file(file, "rt") : cannot open the connection

Calls: read.csv -> read.table -> file

In addition: Warning message:

In file(file, "rt") :

  cannot open file 'hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv>':
No such file or directory

Execution halted

Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited
with 1

at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

at org.apache.spark.deploy.RRunner.main(RRunner.scala)

If I try supplying just flights.csv, I get a different error

--files hdfs://192.168.0.1:8020/user/jhoole/flights.csv<http://192.168.0.1:8020/user/jhoole/flights.csv>
local:///opt/spark/examples/src/main/r/data-manipulation.R flights.csv


Error: Error in loadDF : analysis error - Path does not exist: hdfs://192.168.0.1:8020/user/root/flights.csv<http://192.168.0.1:8020/user/root/flights.csv>;

Execution halted

Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited
with 1

at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

at org.apache.spark.deploy.RRunner.main(RRunner.scala)

If the path /user/root/flights.csv does exist and I only supply "flights.csv" as the file
path, it runs to completion successfully. However, if I provide the file path as "hdfs://192.168.0.1:8020/user/root/flights.csv,<http://192.168.0.1:8020/user/root/flights.csv,>"
I get the same "No such file or directory" error as I do initially.

Since I obviously can't put all my hdfs files under /user/root, how do I get it to use the
file that the init-container is fetching?

Thanks,
Jenna





Mime
View raw message