spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Saikat Kanjilal <>
Subject ML Repo using spark
Date Fri, 21 Apr 2017 17:17:53 GMT
I've been building out a large machine learning repository using spark as the compute platform
running on yarn and hadoop, I was wondering if folks have some best practice oriented thoughts
around unit testing/integration testing this application, I am using spark-submit and a configuration
file to enable a dynamic workflow such that we can build different ML repos for each of our
models. The ML repos consist of parquet files and eventually hive tables.I want to be able
to unit test this application using scalatest or some other recommended utility, I also want
to integration test the application in our int environment, specifically we have a dev/int
and eventually prod and a prod environment consisting of spark running on hadoop usign yarn.

The ideal workflow in my mind would be:</div>
1) unit tests run upon every checkin in our dev enviroment</div>
2) application gets propagated to our int environment</div>
3) integration tests run successfully in our int environment</div>
4) application gets propagated to our prod environment</div>
5) hive table/parquet file gets generated and consumed by scala notebooks running on top of
spark cluster</div>

**Caveat I wasnt sure if this was more appropriate for dev or user mailing list but given
that I only am following dev I sent this here.

Best Regards

View raw message