spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Štěrba <i...@jansterba.com>
Subject Re: adding rows to a DataFrame
Date Fri, 11 Mar 2016 20:51:51 GMT
It very much depends on the logic that generates the new rows. Is it
per row (i.e. without context?) then you can just convert to RDD and
perform a map operation on each row.

JavaPairRDD<Object, Iterable<Row>> grouped =
dataFrame.javaRDD().groupBy( group by what you need, probably ID );

return grouped.mapValues(rowsIt -> {
    List<Row> rows = Lists.newArrayList(rowsIt);
    return new list of rows based on you logic.
});

convert back to DataFrame using flatMap and createDataFrame
--
Jan Sterba
https://twitter.com/honzasterba | http://flickr.com/honzasterba |
http://500px.com/honzasterba


On Fri, Mar 11, 2016 at 8:49 PM, Michael Armbrust
<michael@databricks.com> wrote:
> Or look at explode on DataFrame
>
> On Fri, Mar 11, 2016 at 10:45 AM, Stefan Panayotov <spanayotov@msn.com>
> wrote:
>>
>> Hi,
>>
>> I have a problem that requires me to go through the rows in a DataFrame
>> (or possibly through rows in a JSON file) and conditionally add rows
>> depending on a value in one of the columns in each existing row. So, for
>> example if I have:
>>
>>
>> +---+---+---+
>>
>> | _1| _2| _3|
>> +---+---+---+
>> |ID1|100|1.1|
>> |ID2|200|2.2|
>> |ID3|300|3.3|
>> |ID4|400|4.4|
>> +---+---+---+
>>
>> I need to be able to get:
>>
>>
>> +---+---+---+--------------------+---+
>>
>> | _1| _2| _3|                  _4| _5|
>> +---+---+---+--------------------+---+
>> |ID1|100|1.1|ID1 add text or d...| 25|
>> |id11 ..|21 |
>> |id12 ..|22 |
>> |ID2|200|2.2|ID2 add text or d...| 50|
>> |id21 ..|33 |
>> |id22 ..|34 |
>> |id23 ..|35 |
>> |ID3|300|3.3|ID3 add text or d...| 75|
>> |id31 ..|11 |
>> |ID4|400|4.4|ID4 add text or d...|100|
>> |id41 ..|51 |
>> |id42 ..|52 |
>> |id43 ..|53 |
>> |id44 ..|54 |
>> +---+---+---+--------------------+---+
>>
>> How can I achieve this in Spark without doing DF.collect(), which will get
>> everything to the driver and for a big data set I'll get OOM?
>> BTW, I know how to use withColumn() to add new columns to the DataFrame. I
>> need to also add new rows.
>> Any help will be appreciated.
>>
>> Thanks,
>>
>> Stefan Panayotov, PhD
>> Home: 610-355-0919
>> Cell: 610-517-5586
>> email: spanayotov@msn.com
>> spanayotov@outlook.com
>> spanayotov@comcast.net
>>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org


Mime
View raw message