spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jaonary Rabarisoa <>
Subject Re: Some questions after playing a little with the new ml.Pipeline.
Date Sun, 01 Mar 2015 09:23:29 GMT
Hi Joseph,

Thank your for the tips. I understand what should I do when my data are
represented as a RDD. The thing that I can't figure out is how to do the
same thing when the data is view as a DataFrame and I need to add the
result of my pretrained model as a new column in the DataFrame. Preciselly,
I want to implement the following transformer :

class DeepCNNFeature extends Transformer ... {


On Sun, Mar 1, 2015 at 1:32 AM, Joseph Bradley <>

> Hi Jao,
> You can use external tools and libraries if they can be called from your
> Spark program or script (with appropriate conversion of data types, etc.).
> The best way to apply a pre-trained model to a dataset would be to call the
> model from within a closure, e.g.:
> { myDatum => preTrainedModel.predict(myDatum) }
> If your data is distributed in an RDD (myRDD), then the above call will
> distribute the computation of prediction using the pre-trained model.  It
> will require that all of your Spark workers be able to run the
> preTrainedModel; that may mean installing Caffe and dependencies on all
> nodes in the compute cluster.
> For the second question, I would modify the above call as follows:
> myRDD.mapPartitions { myDataOnPartition =>
>   val myModel = // instantiate neural network on this partition
> { myDatum => myModel.predict(myDatum) }
> }
> I hope this helps!
> Joseph
> On Fri, Feb 27, 2015 at 10:27 PM, Jaonary Rabarisoa <>
> wrote:
>> Dear all,
>> We mainly do large scale computer vision task (image classification,
>> retrieval, ...). The pipeline is really great stuff for that. We're trying
>> to reproduce the tutorial given on that topic during the latest spark
>> summit (
>> using the master version of spark pipeline and dataframe. The tutorial
>> shows different examples of feature extraction stages before running
>> machine learning algorithms. Even the tutorial is straightforward to
>> reproduce with this new API, we still have some questions :
>>    - Can one use external tools (e.g via pipe) as a pipeline stage ? An
>>    example of use case is to extract feature learned with convolutional neural
>>    network. In our case, this corresponds to a pre-trained neural network with
>>    Caffe library ( .
>>    - The second question is about the performance of the pipeline.
>>    Library such as Caffe processes the data in batch and instancing one Caffe
>>    network can be time consuming when this network is very deep. So, we can
>>    gain performance if we minimize the number of Caffe network creation and
>>    give data in batch to the network. In the pipeline, this corresponds to run
>>    transformers that work on a partition basis and give the whole partition to
>>    a single caffe network. How can we create such a transformer ?
>> Best,
>> Jao

View raw message