spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Xiangrui Meng <men...@gmail.com>
Subject Re: [sample code] deeplearning4j for Spark ML (@DeveloperAPI)
Date Thu, 18 Jun 2015 00:12:24 GMT
Hi Eron,

Please register your Spark Package on http://spark-packages.org, which
helps users find your work. Do you have some performance benchmark to
share? Thanks!

Best,
Xiangrui

On Wed, Jun 10, 2015 at 10:48 PM, Nick Pentreath
<nick.pentreath@gmail.com> wrote:
> Looks very interesting, thanks for sharing this.
>
> I haven't had much chance to do more than a quick glance over the code.
> Quick question - are the Word2Vec and GLOVE implementations fully parallel
> on Spark?
>
> On Mon, Jun 8, 2015 at 6:20 PM, Eron Wright <ewright@live.com> wrote:
>>
>>
>> The deeplearning4j framework provides a variety of distributed, neural
>> network-based learning algorithms, including convolutional nets, deep
>> auto-encoders, deep-belief nets, and recurrent nets.      We’re working on
>> integration with the Spark ML pipeline, leveraging the developer API.   This
>> announcement is to share some code and get feedback from the Spark
>> community.
>>
>> The integration code is located in the dl4j-spark-ml module in the
>> deeplearning4j repository.
>>
>> Major aspects of the integration work:
>>
>> ML algorithms.  To bind the dl4j algorithms to the ML pipeline, we
>> developed a new classifier and a new unsupervised learning estimator.
>> ML attributes. We strove to interoperate well with other pipeline
>> components.   ML Attributes are column-level metadata enabling information
>> sharing between pipeline components.    See here how the classifier reads
>> label metadata from a column provided by the new StringIndexer.
>> Large binary data.  It is challenging to work with large binary data in
>> Spark.   An effective approach is to leverage PrunedScan and to carefully
>> control partition sizes.  Here we explored this with a custom data source
>> based on the new relation API.
>> Column-based record readers.  Here we explored how to construct rows from
>> a Hadoop input split by composing a number of column-level readers, with
>> pruning support.
>> UDTs.   With Spark SQL it is possible to introduce new data types.   We
>> prototyped an experimental Tensor type, here.
>> Spark Package.   We developed a spark package to make it easy to use the
>> dl4j framework in spark-shell and with spark-submit.      See the
>> deeplearning4j/dl4j-spark-ml repository for useful snippets involving the
>> sbt-spark-package plugin.
>> Example code.   Examples demonstrate how the standardized ML API
>> simplifies interoperability, such as with label preprocessing and feature
>> scaling.   See the deeplearning4j/dl4j-spark-ml-examples repository for an
>> expanding set of example pipelines.
>>
>> Hope this proves useful to the community as we transition to exciting new
>> concepts in Spark SQL and Spark ML.   Meanwhile, we have Spark working with
>> multiple GPUs on AWS and we're looking forward to optimizations that will
>> speed neural net training even more.
>>
>> Eron Wright
>> Contributor | deeplearning4j.org
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message