spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mohammed Guller <moham...@glassbeam.com>
Subject RE: Feature Generation On Spark
Date Sat, 18 Jul 2015 18:16:12 GMT
Try this (replace ... with the appropriate values for your environment):

import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext
import org.apache.spark.mllib.feature.HashingTF
import org.apache.spark.mllib.linalg.Vector

val sc = new SparkContext(...)
val documents = sc.wholeTextFile(...)
val tokenized = documents.map{ case(path, document) => (path, document.split("\\s+"))}
val numFeatures = 100000
val hashingTF = new HashingTF(numFeatures)
val featurized = tokenized.map{case(path, words) => (path, hashingTF.transform(words))}


Mohammed

From: rishikesh thakur [mailto:rishikeshthakur@hotmail.com]
Sent: Friday, July 17, 2015 12:33 AM
To: Mohammed Guller
Subject: Re: Feature Generation On Spark


Thanks I did look at the example. I am using Spark 1.2. The modules mentioned there are not
in 1.2 I guess. The import is failing


Rishi

________________________________
From: Mohammed Guller <mohammed@glassbeam.com<mailto:mohammed@glassbeam.com>>
Sent: Friday, July 10, 2015 2:31 AM
To: rishikesh thakur; ayan guha; Michal Čizmazia
Cc: user
Subject: RE: Feature Generation On Spark


Take a look at the examples here:

https://spark.apache.org/docs/latest/ml-guide.html



Mohammed



From: rishikesh thakur [mailto:rishikeshthakur@hotmail.com]
Sent: Saturday, July 4, 2015 10:49 PM
To: ayan guha; Michal Čizmazia
Cc: user
Subject: RE: Feature Generation On Spark



I have one document per file and each file is to be converted to a feature vector. Pretty
much like standard feature construction for document classification.



Thanks

Rishi

________________________________

Date: Sun, 5 Jul 2015 01:44:04 +1000
Subject: Re: Feature Generation On Spark
From: guha.ayan@gmail.com<mailto:guha.ayan@gmail.com>
To: micizma@gmail.com<mailto:micizma@gmail.com>
CC: rishikeshthakur@hotmail.com<mailto:rishikeshthakur@hotmail.com>; user@spark.apache.org<mailto:user@spark.apache.org>

Do you have one document per file or multiple document in the file?

On 4 Jul 2015 23:38, "Michal Čizmazia" <micizma@gmail.com<mailto:micizma@gmail.com>>
wrote:

Spark Context has a method wholeTextFiles. Is that what you need?

On 4 July 2015 at 07:04, rishikesh <rishikeshthakur@hotmail.com<mailto:rishikeshthakur@hotmail.com>>
wrote:
> Hi
>
> I am new to Spark and am working on document classification. Before model
> fitting I need to do feature generation. Each document is to be converted to
> a feature vector. However I am not sure how to do that. While testing
> locally I have a static list of tokens and when I parse a file I do a lookup
> and increment counters.
>
> In the case of Spark I can create an RDD which loads all the documents
> however I am not sure if one files goes to one executor or multiple. If the
> file is split then the feature vectors needs to be merged. But I am not able
> to figure out how to do that.
>
> Thanks
> Rishi
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Feature-Generation-On-Spark-tp23617.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<mailto:user-unsubscribe@spark.apache.org>
> For additional commands, e-mail: user-help@spark.apache.org<mailto:user-help@spark.apache.org>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org<mailto:user-unsubscribe@spark.apache.org>
For additional commands, e-mail: user-help@spark.apache.org<mailto:user-help@spark.apache.org>

Mime
View raw message