spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ÐΞ€ρ@Ҝ (๏̯͡๏) <deepuj...@gmail.com>
Subject Re: Unable to specify multiple directories as input
Date Fri, 26 Jun 2015 17:18:09 GMT
So for each directory you create one RDD and then union them all.

On Fri, Jun 26, 2015 at 10:05 AM, Bahubali Jain <bahubali@gmail.com> wrote:

> oh..my use case is not very straight forward.
> The input can have multiple directories...
>
> On Fri, Jun 26, 2015 at 9:30 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com>
> wrote:
>
>> Yes, only working solution was to use union.
>>
>>     val inputRecordsDay1 = getInputRecords(startDate, endDate, inputDir,
>> sc, inputType)
>>
>>     val inputRecordsDay2 =
>> getInputRecords(DateUtil.addDaysToDate(startDate, 1),
>> DateUtil.addDaysToDate(endDate, 1), inputDir, sc, inputType)
>>
>>     val inputRecords = inputRecordsDay1.union(inputRecordsDay2)
>>
>>
>> On Fri, Jun 26, 2015 at 7:52 AM, Bahubali Jain <bahubali@gmail.com>
>> wrote:
>>
>>> Hi Deepak,
>>> were you able to find a solution to this?
>>>
>>> Thanks,
>>> Baahu
>>>
>>> On Wed, Apr 8, 2015 at 9:50 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com>
>>> wrote:
>>>
>>>> Spark Version 1.3
>>>> Command:
>>>>
>>>> ./bin/spark-submit -v --master yarn-cluster --driver-class-path
>>>> /apache/hadoop/share/hadoop/common/hadoop-common-2.4.1-company-2.jar:/apache/hadoop/lib/hadoop-lzo-0.6.0.jar:/apache/hadoop-2.4.1-2.1.3.0-2-company/share/hadoop/yarn/lib/guava-11.0.2.jar:/apache/hadoop-2.4.1-2.1.3.0-2-company/share/hadoop/hdfs/hadoop-hdfs-2.4.1-company-2.jar
>>>> --num-executors 100 --driver-memory 4g --driver-java-options
>>>> "-XX:MaxPermSize=4G" --executor-memory 8g --executor-cores 1 --queue
>>>> hdmi-express --class com. company.ep.poc.spark.reporting.SparkApp
>>>> /home/dvasthimal/spark1.3/spark_reporting-1.0-SNAPSHOT.jar
>>>> startDate=2015-04-6 endDate=2015-04-7
>>>> input=/user/dvasthimal/epdatasets_small/exptsession subcommand=viewItem
>>>> output=/user/dvasthimal/epdatasets/viewItem
>>>>
>>>> On Wed, Apr 8, 2015 at 9:49 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) <deepujain@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello,
>>>>> I have two HDFS directories each containing multiple avro files. I
>>>>> want to specify these two directories as input. In Hadoop world, one
can
>>>>> specify list of comma separated directories. In Spark that does not work.
>>>>>
>>>>>
>>>>> Logs
>>>>> ====
>>>>>
>>>>> 15/04/07 21:10:11 INFO storage.BlockManagerMaster: Updated info of
>>>>> block broadcast_2_piece0
>>>>>
>>>>> 15/04/07 21:10:11 INFO spark.SparkContext: Created broadcast 2 from
>>>>> sequenceFile at DataUtil.scala:120
>>>>>
>>>>> 15/04/07 21:10:11 ERROR yarn.ApplicationMaster: User class threw
>>>>> exception: Input path does not exist:
>>>>> hdfs://namenode_host_name:8020/user/dvasthimal/epdatasets_small/exptsession/2015/04/06,/user/dvasthimal/epdatasets_small/exptsession/2015/04/07
>>>>>
>>>>> org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input
>>>>> path does not exist:
>>>>> hdfs://namenode_host_name:8020/user/dvasthimal/epdatasets_small/exptsession/2015/04/06,/user/dvasthimal/epdatasets_small/exptsession/2015/04/07
>>>>>
>>>>> at
>>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:320)
>>>>>
>>>>>  at
>>>>> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:263)
>>>>>
>>>>> ====
>>>>>
>>>>>
>>>>> Input Code:
>>>>>
>>>>> sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable,
>>>>> AvroKeyInputFormat[GenericRecord]](path)
>>>>>
>>>>> Here path is:
>>>>>  /user/dvasthimal/epdatasets_small/exptsession/2015/04/06,/user/dvasthimal/epdatasets_small/exptsession/2015/04/07
>>>>>
>>>>>
>>>>> --
>>>>> Deepak
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Deepak
>>>>
>>>>
>>>
>>>
>>> --
>>> Twitter:http://twitter.com/Baahu
>>>
>>>
>>
>>
>> --
>> Deepak
>>
>>
>
>
> --
> Twitter:http://twitter.com/Baahu
>
>


-- 
Deepak

Mime
View raw message