spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From manueslapera <manueslap...@hotmail.com>
Subject Spark Streaming: How to load a Pipeline on a Stream?
Date Sun, 02 Oct 2016 22:01:43 GMT
I am implementing a lambda architecture system for stream processing. I have
no issue creating a Pipeline with GridSearch in Spark batch:

    pipeline = Pipeline(stages=[data1_indexer, data2_indexer, ...,
assembler, logistic_regressor])
  
    paramGrid = ( 
        ParamGridBuilder()
        .addGrid(logistic_regressor.regParam, (0.01, 0.1))
        .addGrid(logistic_regressor.tol, (1e-5, 1e-6)) 
        ...etcetera
    ).build()

    cv = CrossValidator(
        estimator=pipeline, 
        estimatorParamMaps=paramGrid,
        evaluator=BinaryClassificationEvaluator(), 
        numFolds=4)

    pipeline_cv = cv.fit(raw_train_df) 
    model_fitted = pipeline_cv.getEstimator().fit(raw_validation_df)  
    model_fitted.write().overwrite().save("pipeline")

However, I cant seem to find how to plug the pipeline in the Spark Streaming
Process.

I am using kafka as the DStream source and my code as of now is as follows:

    import json 
    from pyspark.ml import PipelineModel
    from pyspark.streaming.kafka import KafkaUtils
    from pyspark.streaming import StreamingContext

    ssc = StreamingContext(sc, 1) 
    kafkaStream = KafkaUtils.createStream(
        ssc,
        "localhost:2181",
        "spark- streaming-consumer",
        {"kafka_topic": 1}
    )

    model = PipelineModel.load('pipeline/') parsed_stream =
kafkaStream.map(lambda x: json.loads(x[1]))

    CODE MISSING GOES HERE 

    ssc.start()
    ssc.awaitTermination()

and now I need to find some way of doing the actual prediction on the
StreamingContext.

 Based on the documentation found in the gitbooks Twitter Streaming Example
(https://databricks.gitbooks.io/databricks-spark-reference-applications/content/twitter_classifier/predict.html)
) it seems like the model needs to implement the method predict in order to
be able to use it on an rdd object (and hopefully on a kafkastream?) 

How could I use the pipeline on the Streaming context? The reloaded
PipelineModel only seems to implement transform Does that mean the only way
to use batch models in a Streaming context is to use pure models , and no
pipelines?



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-How-to-load-a-Pipeline-on-a-Stream-tp27828.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org


Mime
View raw message