spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From pun <>
Subject mllib - CountVectorizer + LogisticRegression unexpected behavior
Date Wed, 04 Oct 2017 18:29:14 GMT
Hello,I have a model, which uses CountVectorizer and LogisticRegression. 
*Everything seems to work fine, except that when I am running the last step
to get results and predictions, the document ids (doc_id) are being changed
completely. Do you know why that is? Am I doing something wrong?*
import{CountVectorizer, Tokenizer}val tokenizer = new
Tokenizer()  .setInputCol("text")  .setOutputCol("words")val countVectorizer
= new CountVectorizer()  //.setVocabSize(50000) 
.setInputCol(tokenizer.getOutputCol)  .setOutputCol("features")val lr = new
LogisticRegression()  .setMaxIter(10)  .setRegParam(0.01)val pipeline = new
Pipeline()  .setStages(Array(tokenizer, countVectorizer, lr))// Fit the
pipeline to training documents.val model = results
= model.transform(test)
Training and test are two DFs with the following structure:
root |-- doc_id: string (nullable = true) |-- text: string (nullable = true)
|-- label: integer (nullable = false)
Thanks in advance!

Sent from:
View raw message