spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Brian Gawalt <>
Subject Re: Using String Dataset for Logistic Regression
Date Fri, 16 May 2014 22:30:56 GMT

Correct, the logistic regression engine is set up to perform classification
tasks that take feature vectors (arrays of real-valued numbers) that are
given a class label, and learning a linear combination of those features
that divide the classes. As the above commenters have mentioned, there's
lots of different ways to turn string data into feature vectors. 

For instance, if you're classifying documents between, say, spam or valid
email, you may want to start with a bag-of-words model
( ) or the rescaled variant
TF-IDF ( ). You'd turn a single
document into a single, high-dimensional, sparse vector whose element j
encodes the number of appearance term j. Maybe you want to try the
experiment by featurizing on bigrams, trigrams, etc...

Or if you're just trying to tell "english language tweets" from "non-english
language tweets", in which case the bag of words might be overkill: you
could instead try featurizing on just the counts of each pair of consecutive
characters. E.g., the first element counts "aa" appearances, then the second
"ab"...., then "zy" then "zz". Those will be smaller feature vectors,
capturing less information, but it's probably sufficient for the simpler
task, and you'll be able to fit the model with less data than trying to fit
a whole-word-based model.

Different applications are going to need more or less context from your
strings -- whole words? n-grams? just characters? treat them as ENUMs as in
the days of week example? -- so it might not make sense for Spark to come
with "a direct way" to turn a string attribute into a vector for use in
logistic regression. You'll have to settle on the featurization approach
that's right for your domain before you try training the logistic regression
classifier on your labelled feature vectors. 


View this message in context:
Sent from the Apache Spark User List mailing list archive at

View raw message