spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Duy Huynh <duy.huynh....@gmail.com>
Subject Re: what is the best way to implement mini batches?
Date Thu, 11 Dec 2014 19:41:34 GMT
the dataset i'm working on has about 100,000 records.  the batch that we're
training on has a size around 10.  can you repartition(10,000) into 10,000
partitions?

On Thu, Dec 11, 2014 at 2:36 PM, Matei Zaharia <matei.zaharia@gmail.com>
wrote:

> You can just do mapPartitions on the whole RDD, and then called sliding()
> on the iterator in each one to get a sliding window. One problem is that
> you will not be able to slide "forward" into the next partition at
> partition boundaries. If this matters to you, you need to do something more
> complicated to get those, such as the repartition that you said (where you
> map each record to the partition it should be in).
>
> Matei
>
> > On Dec 11, 2014, at 10:16 AM, ll <duy.huynh.uiv@gmail.com> wrote:
> >
> > any advice/comment on this would be much appreciated.
> >
> >
> >
> > --
> > View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/what-is-the-best-way-to-implement-mini-batches-tp20264p20635.html
> > Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> > For additional commands, e-mail: user-help@spark.apache.org
> >
>
>

Mime
View raw message