spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammad Tariq <donta...@gmail.com>
Subject Re: Does DataFrame.collect() maintain the underlying schema?
Date Thu, 03 Mar 2016 00:05:27 GMT
Cool. Here is it how it goes...

I am reading Avro objects from a Kafka topic as a DStream, converting it
into a DataFrame so that I can filter out records based on some conditions
and finally do some aggregations on these filtered records. During the
process I also need to tag each record based on the value of a particular
column, and for this I am iterating over Array[Row] returned by
DataFrame.collect().

I am good as far as these things are concerned. The only thing which I am
not getting is the reason behind changed column ordering within each Row.
Say my actual record is [Tariq, IN, APAC]. When I
do println(row.mkString("~")) it shows [IN~APAC~Tariq].

I hope I was able to explain my use case to you!



[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Thu, Mar 3, 2016 at 5:21 AM, Sainath Palla <pallasainath@gmail.com>
wrote:

> Hi Tariq,
>
> Can you tell in brief what kind of operation you have to do? I can try
> helping you out with that.
> In general, if you are trying to use any group operations you can use
> window operations.
>
> On Wed, Mar 2, 2016 at 6:40 PM, Mohammad Tariq <dontariq@gmail.com> wrote:
>
>> Hi Sainath,
>>
>> Thank you for the prompt response!
>>
>> Could you please elaborate your answer a bit? I'm sorry I didn't quite
>> get this. What kind of operation I can perform using SQLContext? It just
>> helps us during things like DF creation, schema application etc, IMHO.
>>
>>
>>
>> [image: http://]
>>
>> Tariq, Mohammad
>> about.me/mti
>> [image: http://]
>> <http://about.me/mti>
>>
>>
>> On Thu, Mar 3, 2016 at 4:59 AM, Sainath Palla <pallasainath@gmail.com>
>> wrote:
>>
>>> Instead of collecting the data frame, you can try using a sqlContext on
>>> the data frame. But it depends on what kind of operations are you trying to
>>> perform.
>>>
>>> On Wed, Mar 2, 2016 at 6:21 PM, Mohammad Tariq <dontariq@gmail.com>
>>> wrote:
>>>
>>>> Hi list,
>>>>
>>>> *Scenario :*
>>>> I am creating a DStream by reading an Avro object from a Kafka topic
>>>> and then converting it into a DataFrame to perform some operations on the
>>>> data. I call DataFrame.collect() and perform the intended operation on each
>>>> Row of Array[Row] returned by DataFrame.collect().
>>>>
>>>> *Problem : *
>>>> Calling DataFrame.collect() changes the schema of the underlying
>>>> record, thus making it impossible to get the columns by index(as the order
>>>> gets changed).
>>>>
>>>> *Query :*
>>>> Is it the way DataFrame.collect() behaves or am I doing something wrong
>>>> here? In former case is there any way I can maintain the schema while
>>>> getting each Row?
>>>>
>>>> Any pointers/suggestions would be really helpful. Many thanks!
>>>>
>>>>
>>>> [image: http://]
>>>>
>>>> Tariq, Mohammad
>>>> about.me/mti
>>>> [image: http://]
>>>> <http://about.me/mti>
>>>>
>>>>
>>>
>>>
>>
>

Mime
View raw message