spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ai He <heai0...@gmail.com>
Subject Re: multiple hdfs folder & files input to PySpark
Date Wed, 06 May 2015 06:06:10 GMT
Hi Oleg,

For 1, RDD#union will help. You can iterate over folders and union the obtained RDD along.

For 2, seems like it won’t work in a deterministic way according to this discussion(http://stackoverflow.com/questions/24871044/in-spark-what-does-the-parameter-minpartitions-works-in-sparkcontext-textfile).

Thanks
> On May 5, 2015, at 5:59 AM, Oleg Ruchovets <oruchovets@gmail.com> wrote:
> 
> Hi 
>    We are using pyspark 1.3 and input is text files located on hdfs.
> 
> file structure 
>     <day1>
>                 file1.txt
>                 file2.txt
>     <day2>
>                 file1.txt
>                 file2.txt
>      ...
> 
> Question:
> 
>    1) What is the way to provide as an input for PySpark job  multiple files which located
in Multiple folders (on hdfs).
> Using textFile method works fine for single file or folder , but how can I do it using
multiple folders?
> Is there a way to pass array , list of files?
>    
>    2) What is the meaning of partition parameter in textFile method?
> 
>   sc = SparkContext(appName="TAD")
>   lines = sc.textFile(<my input>, 1)
> 
> Thanks
> Oleg.


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message