spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From gy8 <garry.yanggu...@gmail.com>
Subject sqlCtx.load a single big csv file from s3 in parallel
Date Thu, 04 Jun 2015 22:54:29 GMT
Hi there!

I'm trying to read a large .csv file (14GB) into a dataframe from S3 via the
spark-csv package. I want to load this data in parallel utilizing all 20
executors that I have, however by default only 3 executors are being used
(which downloaded 5gb/5gb/4gb).

Here is my script (im using pyspark):

lol_file = sqlCtx.load(source="com.databricks.spark.csv",
                              header="false",
                              path=lol_file_path)

I have tried add option flags 1) minSplits=120, 2) minPartitions=120 but
neither worked. I tried reading the source code but I'm noob at scala and
could not figure out how the options are being used :(

Thank you for reading and any help is much appreciated!

Guang



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/sqlCtx-load-a-single-big-csv-file-from-s3-in-parallel-tp23163.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message