spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammad Tariq <donta...@gmail.com>
Subject Re: Does DataFrame.collect() maintain the underlying schema?
Date Thu, 03 Mar 2016 00:59:36 GMT
I think this could be the reason :

DataFrame sorts the column of each record lexicographically if we do a *select
**. So, if we wish to maintain a specific column ordering while processing
we should use do *select col1, col2...* instead of select *.

However, this is just what I feel. Let's wait for comments from the gurus.



[image: http://]

Tariq, Mohammad
about.me/mti
[image: http://]
<http://about.me/mti>


On Thu, Mar 3, 2016 at 5:35 AM, Mohammad Tariq <dontariq@gmail.com> wrote:

> Cool. Here is it how it goes...
>
> I am reading Avro objects from a Kafka topic as a DStream, converting it
> into a DataFrame so that I can filter out records based on some conditions
> and finally do some aggregations on these filtered records. During the
> process I also need to tag each record based on the value of a particular
> column, and for this I am iterating over Array[Row] returned by
> DataFrame.collect().
>
> I am good as far as these things are concerned. The only thing which I am
> not getting is the reason behind changed column ordering within each Row.
> Say my actual record is [Tariq, IN, APAC]. When I
> do println(row.mkString("~")) it shows [IN~APAC~Tariq].
>
> I hope I was able to explain my use case to you!
>
>
>
> [image: http://]
>
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> <http://about.me/mti>
>
>
> On Thu, Mar 3, 2016 at 5:21 AM, Sainath Palla <pallasainath@gmail.com>
> wrote:
>
>> Hi Tariq,
>>
>> Can you tell in brief what kind of operation you have to do? I can try
>> helping you out with that.
>> In general, if you are trying to use any group operations you can use
>> window operations.
>>
>> On Wed, Mar 2, 2016 at 6:40 PM, Mohammad Tariq <dontariq@gmail.com>
>> wrote:
>>
>>> Hi Sainath,
>>>
>>> Thank you for the prompt response!
>>>
>>> Could you please elaborate your answer a bit? I'm sorry I didn't quite
>>> get this. What kind of operation I can perform using SQLContext? It just
>>> helps us during things like DF creation, schema application etc, IMHO.
>>>
>>>
>>>
>>> [image: http://]
>>>
>>> Tariq, Mohammad
>>> about.me/mti
>>> [image: http://]
>>> <http://about.me/mti>
>>>
>>>
>>> On Thu, Mar 3, 2016 at 4:59 AM, Sainath Palla <pallasainath@gmail.com>
>>> wrote:
>>>
>>>> Instead of collecting the data frame, you can try using a sqlContext on
>>>> the data frame. But it depends on what kind of operations are you trying
to
>>>> perform.
>>>>
>>>> On Wed, Mar 2, 2016 at 6:21 PM, Mohammad Tariq <dontariq@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi list,
>>>>>
>>>>> *Scenario :*
>>>>> I am creating a DStream by reading an Avro object from a Kafka topic
>>>>> and then converting it into a DataFrame to perform some operations on
the
>>>>> data. I call DataFrame.collect() and perform the intended operation on
each
>>>>> Row of Array[Row] returned by DataFrame.collect().
>>>>>
>>>>> *Problem : *
>>>>> Calling DataFrame.collect() changes the schema of the underlying
>>>>> record, thus making it impossible to get the columns by index(as the
order
>>>>> gets changed).
>>>>>
>>>>> *Query :*
>>>>> Is it the way DataFrame.collect() behaves or am I doing something
>>>>> wrong here? In former case is there any way I can maintain the schema
while
>>>>> getting each Row?
>>>>>
>>>>> Any pointers/suggestions would be really helpful. Many thanks!
>>>>>
>>>>>
>>>>> [image: http://]
>>>>>
>>>>> Tariq, Mohammad
>>>>> about.me/mti
>>>>> [image: http://]
>>>>> <http://about.me/mti>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
View raw message