spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Dillon Dukek <>
Subject Re: Train ML models on each partition
Date Thu, 09 May 2019 06:49:03 GMT
Hi Qian,

The way that I have gotten around this type of problem in the past is to do
a groupBy on the dimensions that you want to build a model for and then
initialize, and train a model using a package like scikit learn  for each
group in something like a group map pandas udf. If you need these models
persisted you could then write them out to S3. Alternatively, you could
also switch the scheduler mode to fair instead of FIFO and spawn a thread
pool that trains a spark model on the appropriate data from the source


On Wed, May 8, 2019 at 10:34 PM Qian He <> wrote:

> I have a 1TB dataset with 100 columns. The first column is a user_id,
> there are about 1000 unique user_ids in this 1TB dataset.
> The use case: I want to train a ML model for each user_id on this user's
> records (approximately 1GB records per user). Say the ML model is a
> Decision Tree. But it is not feasible to create 1000 Spark applications to
> achieve this. Can I launch just one Spark application and accomplish the
> trainings of these 1000 DT models? How?
> Can I just partition the 1TB data by user_id, and then train model for
> each partition?
> Thanks!

View raw message