spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcelo Valle <marcelo.va...@ktech.com>
Subject Re: custom rdd - do I need a hadoop input format?
Date Wed, 18 Sep 2019 09:16:06 GMT
To implement a custom RDD with getPartitions, I have to extend
`NewHadoopRDD` informing the hadoop input format class, right?
What input format could I inform so the file won't be read all at once and
my getPartitions method could split by block?

On Tue, 17 Sep 2019 at 18:53, Arun Mahadevan <arunm@apache.org> wrote:

> You can do it with custom RDD implementation.
> You will mainly implement "getPartitions" - the logic to split your input
> into partitions and "compute" to compute and return the values from the
> executors.
>
> On Tue, 17 Sep 2019 at 08:47, Marcelo Valle <marcelo.valle@ktech.com>
> wrote:
>
>> Just to be more clear about my requirements, what I have is actually a
>> custom format, with header, summary and multi line blocks. I want to create
>> tasks per block and no per line.I already have a library that reads an
>> InputStream and outputs an Iterator of Block, but now I need to integrate
>> this with spark
>>
>> On Tue, 17 Sep 2019 at 16:28, Marcelo Valle <marcelo.valle@ktech.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I want to create a custom RDD which will read n lines in sequence from a
>>> file, which I call a block, and each block should be converted to a spark
>>> dataframe to be processed in parallel.
>>>
>>> Question - do I have to implement a custom hadoop input format to
>>> achieve this? Or is it possible to do it only with RDD APIs?
>>>
>>> Thanks,
>>> Marcelo.
>>>
>>
>> This email is confidential [and may be protected by legal privilege]. If
>> you are not the intended recipient, please do not copy or disclose its
>> content but contact the sender immediately upon receipt.
>>
>> KTech Services Ltd is registered in England as company number 10704940.
>>
>> Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE,
>> United Kingdom
>>
>

This email is confidential [and may be protected by legal privilege]. If you are not the intended
recipient, please do not copy or disclose its content but contact the sender immediately upon
receipt.

KTech Services Ltd is registered in England as company number 10704940.

Registered Office: The River Building, 1 Cousin Lane, London EC4R 3TE, United Kingdom

Mime
View raw message