spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tathagata Das <t...@databricks.com>
Subject Re: Partitioning in spark streaming
Date Wed, 12 Aug 2015 19:54:34 GMT
Yes.

On Wed, Aug 12, 2015 at 12:12 PM, Mohit Anchlia <mohitanchlia@gmail.com>
wrote:

> Thanks! To write to hdfs I do need to use saveAs method?
>
> On Wed, Aug 12, 2015 at 12:01 PM, Tathagata Das <tdas@databricks.com>
> wrote:
>
>> This is how Spark does. It writes the task output to a uniquely-named
>> temporary file, and then atomically (after the task successfully completes)
>> renames the temp file to the expected file name <file>/<partition-XXX>
>>
>>
>> On Tue, Aug 11, 2015 at 9:53 PM, Mohit Anchlia <mohitanchlia@gmail.com>
>> wrote:
>>
>>> Thanks for the info. When data is written in hdfs how does spark keeps
>>> the filenames written by multiple executors unique
>>>
>>> On Tue, Aug 11, 2015 at 9:35 PM, Hemant Bhanawat <hemant9379@gmail.com>
>>> wrote:
>>>
>>>> Posting a comment from my previous mail post:
>>>>
>>>> When data is received from a stream source, receiver creates blocks of
>>>> data.  A new block of data is generated every blockInterval milliseconds.
N
>>>> blocks of data are created during the batchInterval where N =
>>>> batchInterval/blockInterval. A RDD is created on the driver for the blocks
>>>> created during the batchInterval. The blocks generated during the
>>>> batchInterval are partitions of the RDD.
>>>>
>>>> Now if you want to repartition based on a key, a shuffle is needed.
>>>>
>>>> On Wed, Aug 12, 2015 at 4:36 AM, Mohit Anchlia <mohitanchlia@gmail.com>
>>>> wrote:
>>>>
>>>>> How does partitioning in spark work when it comes to streaming? What's
>>>>> the best way to partition a time series data grouped by a certain tag
like
>>>>> categories of product video, music etc.
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message