spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohit Anchlia <mohitanch...@gmail.com>
Subject Re: Too many files/dirs in hdfs
Date Mon, 24 Aug 2015 21:51:57 GMT
Any help would be appreciated

On Wed, Aug 19, 2015 at 9:38 AM, Mohit Anchlia <mohitanchlia@gmail.com>
wrote:

> My question was how to do this in Hadoop? Could somebody point me to some
> examples?
>
> On Tue, Aug 18, 2015 at 10:43 PM, UMESH CHAUDHARY <umesh9794@gmail.com>
> wrote:
>
>> Of course, Java or Scala can do that:
>> 1) Create a FileWriter with append or roll over option
>> 2) For each RDD create a StringBuilder after applying your filters
>> 3) Write this StringBuilder to File when you want to write (The duration
>> can be defined as a condition)
>>
>> On Tue, Aug 18, 2015 at 11:05 PM, Mohit Anchlia <mohitanchlia@gmail.com>
>> wrote:
>>
>>> Is there a way to store all the results in one file and keep the file
>>> roll over separate than the spark streaming batch interval?
>>>
>>> On Mon, Aug 17, 2015 at 2:39 AM, UMESH CHAUDHARY <umesh9794@gmail.com>
>>> wrote:
>>>
>>>> In Spark Streaming you can simply check whether your RDD contains any
>>>> records or not and if records are there you can save them using
>>>> FIleOutputStream:
>>>>
>>>> DStream.foreachRDD(t=> { var count = t.count(); if (count>0){ // SAVE
>>>> YOUR STUFF} };
>>>>
>>>> This will not create unnecessary files of 0 bytes.
>>>>
>>>> On Mon, Aug 17, 2015 at 2:51 PM, Akhil Das <akhil@sigmoidanalytics.com>
>>>> wrote:
>>>>
>>>>> Currently, spark streaming would create a new directory for every
>>>>> batch and store the data to it (whether it has anything or not). There
is
>>>>> no direct append call as of now, but you can achieve this either with
>>>>> FileUtil.copyMerge
>>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/save-spark-streaming-output-to-single-file-on-hdfs-td21124.html#a21167>
>>>>> or have a separate program which will do the clean up for you.
>>>>>
>>>>> Thanks
>>>>> Best Regards
>>>>>
>>>>> On Sat, Aug 15, 2015 at 5:20 AM, Mohit Anchlia <mohitanchlia@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Spark stream seems to be creating 0 bytes files even when there is
no
>>>>>> data. Also, I have 2 concerns here:
>>>>>>
>>>>>> 1) Extra unnecessary files is being created from the output
>>>>>> 2) Hadoop doesn't work really well with too many files and I see
that
>>>>>> it is creating a directory with a timestamp every 1 second. Is there
a
>>>>>> better way of writing a file, may be use some kind of append mechanism
>>>>>> where one doesn't have to change the batch interval.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message