spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sean Owen <so...@cloudera.com>
Subject Re: Why are there different "parts" in my CSV?
Date Sat, 14 Feb 2015 21:08:42 GMT
No, they appear as directories + files to everything. Lots of tools
are used to taking an input that is a directory of part files though.
You can certainly point MR, Hive, etc at a directory of these files.

On Sat, Feb 14, 2015 at 9:05 PM, Su She <suhshekar52@gmail.com> wrote:
> Thanks Sean and Akhil! I will take out the repartition(1).  Please let me
> know if I understood this correctly, Spark Streamingwrites data like this:
>
> foo-10000001.csv/part -xxxxx, part-xxxxx
> foo-10000002.csv/part -xxxxx, part-xxxxx
>
> When I see this on Hue, the csv's appear to me as directories, but if I
> understand correctly, they will appear as csv files to other hadoop
> ecosystem tools? And, if I understand Tathagata's answer correctly, other
> hadoop based ecosystems, such as Hive, will be able to create a table based
> of the multiple foo-100000x.csv "directories"?
>
> Thank you, I really appreciate the help!
>
> On Sat, Feb 14, 2015 at 3:20 AM, Sean Owen <sowen@cloudera.com> wrote:
>>
>> Keep in mind that if you repartition to 1 partition, you are only
>> using 1 task to write the output, and potentially only 1 task to
>> compute some parent RDDs. You lose parallelism.  The
>> files-in-a-directory output scheme is standard for Hadoop and for a
>> reason.
>>
>> Therefore I would consider separating this concern and merging the
>> files afterwards if you need to.
>>
>> On Sat, Feb 14, 2015 at 8:39 AM, Akhil Das <akhil@sigmoidanalytics.com>
>> wrote:
>> > Simplest way would be to merge the output files at the end of your job
>> > like:
>> >
>> > hadoop fs -getmerge /output/dir/on/hdfs/ /desired/local/output/file.txt
>> >
>> > If you want to do it pro grammatically, then you can use the
>> > FileUtil.copyMerge API
>> > . like:
>> >
>> > FileUtil.copyMerge(FileSystem of source(hdfs), /output-location,
>> > FileSystem
>> > of destination(hdfs), Path to the merged files /merged-ouput, true(to
>> > delete
>> > the original dir),null)
>> >
>> >
>> >
>> > Thanks
>> > Best Regards
>> >
>> > On Sat, Feb 14, 2015 at 2:18 AM, Su She <suhshekar52@gmail.com> wrote:
>> >>
>> >> Thanks Akhil for the suggestion, it is now only giving me one part -
>> >> xxxx.
>> >> Is there anyway I can just create a file rather than a directory? It
>> >> doesn't
>> >> seem like there is just a saveAsTextFile option for
>> >> JavaPairRecieverDstream.
>> >>
>> >> Also, for the copy/merge api, how would I add that to my spark job?
>> >>
>> >> Thanks Akhil!
>> >>
>> >> Best,
>> >>
>> >> Su
>> >>
>> >> On Thu, Feb 12, 2015 at 11:51 PM, Akhil Das
>> >> <akhil@sigmoidanalytics.com>
>> >> wrote:
>> >>>
>> >>> For streaming application, for every batch it will create a new
>> >>> directory
>> >>> and puts the data in it. If you don't want to have multiple files
>> >>> inside the
>> >>> directory as part-xxxx then you can do a repartition before the
>> >>> saveAs*
>> >>> call.
>> >>>
>> >>>
>> >>>
>> >>> messages.repartition(1).saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>> >>> String.class, (Class) TextOutputFormat.class);
>> >>>
>> >>>
>> >>> Thanks
>> >>> Best Regards
>> >>>
>> >>> On Fri, Feb 13, 2015 at 11:59 AM, Su She <suhshekar52@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Hello Everyone,
>> >>>>
>> >>>> I am writing simple word counts to hdfs using
>> >>>>
>> >>>> messages.saveAsHadoopFiles("hdfs://user/ec2-user/","csv",String.class,
>> >>>> String.class, (Class) TextOutputFormat.class);
>> >>>>
>> >>>> 1) However, each 2 seconds I getting a new directory that is titled
>> >>>> as a
>> >>>> csv. So i'll have test.csv, which will be a directory that has two
>> >>>> files
>> >>>> inside of it called part-00000 and part 00001 (something like that).
>> >>>> This
>> >>>> obv makes it very hard for me to read the data stored in the csv
>> >>>> files. I am
>> >>>> wondering if there is a better way to store the
>> >>>> JavaPairRecieverDStream and
>> >>>> JavaPairDStream?
>> >>>>
>> >>>> 2) I know there is a copy/merge hadoop api for merging files...can
>> >>>> this
>> >>>> be done inside java? I am not sure the logic behind this api if
I am
>> >>>> using
>> >>>> spark streaming which is constantly making new files.
>> >>>>
>> >>>> Thanks a lot for the help!
>> >>>
>> >>>
>> >>
>> >
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message