spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Qian He <>
Subject Train ML models on each partition
Date Thu, 09 May 2019 05:27:50 GMT
I have a 1TB dataset with 100 columns. The first column is a user_id, there
are about 1000 unique user_ids in this 1TB dataset.

The use case: I want to train a ML model for each user_id on this user's
records (approximately 1GB records per user). Say the ML model is a
Decision Tree. But it is not feasible to create 1000 Spark applications to
achieve this. Can I launch just one Spark application and accomplish the
trainings of these 1000 DT models? How?

Can I just partition the 1TB data by user_id, and then train model for each


View raw message