spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Georg Heiler <georg.kf.hei...@gmail.com>
Subject Re: run huge number of queries in Spark
Date Wed, 04 Apr 2018 09:14:57 GMT
See https://gist.github.com/geoHeil/e07922229860262ceebf830859716bbf in
particular:

You will probably want to use sparks imperative (non SQL) API:
.rdd
.reduceByKey {
(count1, count2) => count1 + count2
}.map {
case ((word, path), n) => (word, (path, n))
}.toDF
i.e. builds an inverted index
which easily lets you then calculate TF / IDF
But spark also comes with
https://spark.apache.org/docs/latest/mllib-feature-extraction.html#tf-idf which
might help you to easily achieve the desired result.

Donni Khan <prince.donnii@googlemail.com> schrieb am Mi., 4. Apr. 2018 um
10:56 Uhr:

> Hi all,
>
> I want to run huge number of queries on Dataframe in Spark. I have a big
> data of text documents, I loded all documents into SparkDataFrame and
> create a temp table.
>
> dataFrame.registerTempTable("table1");
>
> I have more than 50,000 terms, I want to get the document frequency for
> each by using the "table1".
>
> I use the follwing:
>
> DataFrame df=sqlContext.sql("select count(ID) from table1 where text like
> '%"+term+"%'");
>
> but this scenario needs much time to finish because I have t run it from
> Spark Driver for each term.
>
>
> Does anyone has idea how I can run all queries in distributed way?
>
> Thank you && Best Regards,
>
> Donni
>
>
>
>

Mime
View raw message