spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Metzger <m...@flexiblecreations.com>
Subject Re: Is spark a right tool for updating a dataframe repeatedly
Date Tue, 18 Oct 2016 03:28:49 GMT
I've not done this in Scala yet, but in PySpark I've run into a similar
issue where having too many dataframes cached does cause memory issues.
Unpersist by itself did not clear the memory usage, but rather setting the
variable equal to None allowed all the references to be cleared and the
memory issues went away.

I do not full understand Scala yet, but you may be able to set one of your
dataframes to null to accomplish the same.

Mike


On Mon, Oct 17, 2016 at 8:38 PM, Mungeol Heo <mungeol.heo@gmail.com> wrote:

> First of all, Thank you for your comments.
> Actually, What I mean "update" is generate a new data frame with modified
> data.
> The more detailed while loop will be something like below.
>
> var continue = 1
> var dfA = "a data frame"
> dfA.persist
>
> while (continue > 0) {
>   val temp = "modified dfA"
>   temp.persist
>   temp.count
>   dfA.unpersist
>
>   dfA = "modified temp"
>   dfA.persist
>   dfA.count
>   temp.unperist
>
>   if ("dfA is not modifed") {
>     continue = 0
>   }
> }
>
> The problem is it will cause OOM finally.
> And, the number of skipped stages will increase ever time, even though
> I am not sure whether this is the reason causing OOM.
> Maybe, I need to check the source code of one of the spark ML algorithms.
> Again, thank you all.
>
>
> On Mon, Oct 17, 2016 at 10:54 PM, Thakrar, Jayesh
> <jthakrar@conversantmedia.com> wrote:
> > Yes, iterating over a dataframe and making changes is not uncommon.
> >
> > Ofcourse RDDs, dataframes and datasets are immultable, but there is some
> > optimization in the optimizer that can potentially help to dampen the
> > effect/impact of creating a new rdd, df or ds.
> >
> > Also, the use-case you cited is similar to what is done in regression,
> > clustering and other algorithms.
> >
> > I.e. you iterate making a change to a dataframe/dataset until the desired
> > condition.
> >
> > E.g. see this -
> > https://spark.apache.org/docs/1.6.1/ml-classification-
> regression.html#linear-regression
> > and the setting of the iteration ceiling
> >
> >
> >
> > // instantiate the base classifier
> >
> > val classifier = new LogisticRegression()
> >
> >   .setMaxIter(params.maxIter)
> >
> >   .setTol(params.tol)
> >
> >   .setFitIntercept(params.fitIntercept)
> >
> >
> >
> > Now the impact of that depends on a variety of things.
> >
> > E.g. if the data is completely contained in memory and there is no spill
> > over to disk, it might not be a big issue (ofcourse there will still be
> > memory, CPU and network overhead/latency).
> >
> > If you are looking at storing the data on disk (e.g. as part of a
> checkpoint
> > or explicit storage), then there can be substantial I/O activity.
> >
> >
> >
> >
> >
> >
> >
> > From: Xi Shen <davidshen84@gmail.com>
> > Date: Monday, October 17, 2016 at 2:54 AM
> > To: Divya Gehlot <divya.htconex@gmail.com>, Mungeol Heo
> > <mungeol.heo@gmail.com>
> > Cc: "user @spark" <user@spark.apache.org>
> > Subject: Re: Is spark a right tool for updating a dataframe repeatedly
> >
> >
> >
> > I think most of the "big data" tools, like Spark and Hive, are not
> designed
> > to edit data. They are only designed to query data. I wonder in what
> > scenario you need to update large volume of data repetitively.
> >
> >
> >
> >
> >
> > On Mon, Oct 17, 2016 at 2:00 PM Divya Gehlot <divya.htconex@gmail.com>
> > wrote:
> >
> > If  my understanding is correct about your query
> >
> > In spark Dataframes are immutable , cant update the dataframe.
> >
> > you have to create a new dataframe to update the current dataframe .
> >
> >
> >
> >
> >
> > Thanks,
> >
> > Divya
> >
> >
> >
> >
> >
> > On 17 October 2016 at 09:50, Mungeol Heo <mungeol.heo@gmail.com> wrote:
> >
> > Hello, everyone.
> >
> > As I mentioned at the tile, I wonder that is spark a right tool for
> > updating a data frame repeatedly until there is no more date to
> > update.
> >
> > For example.
> >
> > while (if there was a updating) {
> > update a data frame A
> > }
> >
> > If it is the right tool, then what is the best practice for this kind of
> > work?
> > Thank you.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe e-mail: user-unsubscribe@spark.apache.org
> >
> >
> >
> > --
> >
> >
> > Thanks,
> > David S.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Mime
View raw message