spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andrea Esposito <and1...@gmail.com>
Subject Re: Efficient Aggregation over DB data
Date Thu, 01 May 2014 16:47:58 GMT
Hi Sai,

i don't sincerely figure out where you are using the RDDs (because the
split method isn't defined in them) by the way you should use the map
function instead of the foreach due the fact it is NOT idempotent and some
partitions could be recomputed executing the function multiple times.

What maybe you are searching is:

val input = sc.textFile(inputFile)
val result= input.flatMap(line => line.split("\\n").map(x =>
x.split("\\s")(2).toInt))
result.max
result.min

result.filter

??

Best,
EA



2014-04-22 11:02 GMT+02:00 Sai Prasanna <ansaiprasanna@gmail.com>:

> Hi All,
>
> I want to access a particular column of a DB table stored in a CSV format
> and perform some aggregate queries over it. I wrote the following query in
> scala as a first step.
>
> *var add=(x:String)=>x.split("\\s+)(2).toInt*
> *var result=List[Int]()*
>
> *input.split("\\n").foreach(x=>result::=add(x)) *
> *[Queries:]result.max/min/filter/sum...*
>
> But is there an efficient way/in-built function to access a particular
> column value or entire column in Spark ? Because built-in implementation
> might be efficient !
>
> Thanks.
>

Mime
View raw message