spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Donni Khan <prince.don...@googlemail.com>
Subject Calculate co-occurring terms
Date Fri, 23 Mar 2018 07:57:08 GMT
Hi,

I have a collection of text documents, I extracted the list of significat
terms from that collection.
I want to calculate co-occurance matrix for the extracted terms by using
spark.

I actually stored the the collection of text document in a DataFrame,

StructType schema = *new* StructType(*new* StructField[] {

*new* StructField("ID", DataTypes.*StringType*, *false*,

Metadata.*empty*()),

*new* StructField("text", DataTypes.*StringType*, *false*,

Metadata.*empty*()) });

// Create a DataFrame *wrt* a new schema

DataFrame preProcessedDF = sqlContext.createDataFrame(jrdd, schema);

I can extract the list of terms from "preProcessedDF " into a List or RDD
or DataFrame.
for each (term_i,term_j) I want to calculate the realted frequency from the
original dataset "preProcessedDF "

anyone has scalbale soloution?

thank you,
Donni

Mime
View raw message