spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mungeol Heo <mungeol....@gmail.com>
Subject Re: Is spark a right tool for updating a dataframe repeatedly
Date Tue, 18 Oct 2016 01:38:46 GMT
First of all, Thank you for your comments.
Actually, What I mean "update" is generate a new data frame with modified data.
The more detailed while loop will be something like below.

var continue = 1
var dfA = "a data frame"
dfA.persist

while (continue > 0) {
  val temp = "modified dfA"
  temp.persist
  temp.count
  dfA.unpersist

  dfA = "modified temp"
  dfA.persist
  dfA.count
  temp.unperist

  if ("dfA is not modifed") {
    continue = 0
  }
}

The problem is it will cause OOM finally.
And, the number of skipped stages will increase ever time, even though
I am not sure whether this is the reason causing OOM.
Maybe, I need to check the source code of one of the spark ML algorithms.
Again, thank you all.


On Mon, Oct 17, 2016 at 10:54 PM, Thakrar, Jayesh
<jthakrar@conversantmedia.com> wrote:
> Yes, iterating over a dataframe and making changes is not uncommon.
>
> Ofcourse RDDs, dataframes and datasets are immultable, but there is some
> optimization in the optimizer that can potentially help to dampen the
> effect/impact of creating a new rdd, df or ds.
>
> Also, the use-case you cited is similar to what is done in regression,
> clustering and other algorithms.
>
> I.e. you iterate making a change to a dataframe/dataset until the desired
> condition.
>
> E.g. see this -
> https://spark.apache.org/docs/1.6.1/ml-classification-regression.html#linear-regression
> and the setting of the iteration ceiling
>
>
>
> // instantiate the base classifier
>
> val classifier = new LogisticRegression()
>
>   .setMaxIter(params.maxIter)
>
>   .setTol(params.tol)
>
>   .setFitIntercept(params.fitIntercept)
>
>
>
> Now the impact of that depends on a variety of things.
>
> E.g. if the data is completely contained in memory and there is no spill
> over to disk, it might not be a big issue (ofcourse there will still be
> memory, CPU and network overhead/latency).
>
> If you are looking at storing the data on disk (e.g. as part of a checkpoint
> or explicit storage), then there can be substantial I/O activity.
>
>
>
>
>
>
>
> From: Xi Shen <davidshen84@gmail.com>
> Date: Monday, October 17, 2016 at 2:54 AM
> To: Divya Gehlot <divya.htconex@gmail.com>, Mungeol Heo
> <mungeol.heo@gmail.com>
> Cc: "user @spark" <user@spark.apache.org>
> Subject: Re: Is spark a right tool for updating a dataframe repeatedly
>
>
>
> I think most of the "big data" tools, like Spark and Hive, are not designed
> to edit data. They are only designed to query data. I wonder in what
> scenario you need to update large volume of data repetitively.
>
>
>
>
>
> On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot <divya.htconex@gmail.com>
> wrote:
>
> If  my understanding is correct about your query
>
> In spark Dataframes are immutable , cant update the dataframe.
>
> you have to create a new dataframe to update the current dataframe .
>
>
>
>
>
> Thanks,
>
> Divya
>
>
>
>
>
> On 17 October 2016 at 09:50, Mungeol Heo <mungeol.heo@gmail.com> wrote:
>
> Hello, everyone.
>
> As I mentioned at the tile, I wonder that is spark a right tool for
> updating a data frame repeatedly until there is no more date to
> update.
>
> For example.
>
> while (if there was a updating) {
> update a data frame A
> }
>
> If it is the right tool, then what is the best practice for this kind of
> work?
> Thank you.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
>
> --
>
>
> Thanks,
> David S.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message