I think this could be the reason :

DataFrame sorts the column of each record lexicographically if we do a select *. So, if we wish to maintain a specific column ordering while processing we should use do select col1, col2... instead of select *.

However, this is just what I feel. Let's wait for comments from the gurus.


On Thu, Mar 3, 2016 at 5:35 AM, Mohammad Tariq <dontariq@gmail.com> wrote:
Cool. Here is it how it goes...

I am reading Avro objects from a Kafka topic as a DStream, converting it into a DataFrame so that I can filter out records based on some conditions and finally do some aggregations on these filtered records. During the process I also need to tag each record based on the value of a particular column, and for this I am iterating over Array[Row] returned by DataFrame.collect().

I am good as far as these things are concerned. The only thing which I am not getting is the reason behind changed column ordering within each Row. Say my actual record is [Tariq, IN, APAC]. When I do println(row.mkString("~")) it shows [IN~APAC~Tariq].

I hope I was able to explain my use case to you!


On Thu, Mar 3, 2016 at 5:21 AM, Sainath Palla <pallasainath@gmail.com> wrote:
Hi Tariq,

Can you tell in brief what kind of operation you have to do? I can try helping you out with that. 
In general, if you are trying to use any group operations you can use window operations.

On Wed, Mar 2, 2016 at 6:40 PM, Mohammad Tariq <dontariq@gmail.com> wrote:
Hi Sainath,

Thank you for the prompt response!

Could you please elaborate your answer a bit? I'm sorry I didn't quite get this. What kind of operation I can perform using SQLContext? It just helps us during things like DF creation, schema application etc, IMHO.


On Thu, Mar 3, 2016 at 4:59 AM, Sainath Palla <pallasainath@gmail.com> wrote:
Instead of collecting the data frame, you can try using a sqlContext on the data frame. But it depends on what kind of operations are you trying to perform. 

On Wed, Mar 2, 2016 at 6:21 PM, Mohammad Tariq <dontariq@gmail.com> wrote:
Hi list,

Scenario :
I am creating a DStream by reading an Avro object from a Kafka topic and then converting it into a DataFrame to perform some operations on the data. I call DataFrame.collect() and perform the intended operation on each Row of Array[Row] returned by DataFrame.collect().

Problem : 
Calling DataFrame.collect() changes the schema of the underlying record, thus making it impossible to get the columns by index(as the order gets changed).

Query :
Is it the way DataFrame.collect() behaves or am I doing something wrong here? In former case is there any way I can maintain the schema while getting each Row?

Any pointers/suggestions would be really helpful. Many thanks!