lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "G, Rajesh" ...@cebglobal.com>
Subject RE: Facet ignoring repeated word
Date Fri, 06 May 2016 05:44:18 GMT
Hi,

Can you please help? If there is a solution then It will be easy, else I have to create a
script in python that can process the results from TermVectorComponent and group the result
by words in different documents to find the word count. The Python script will accept the
exported Solr result as input

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th
Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may
contain confidential and legally privileged information belonging to CEB and/or its subsidiaries,
including SHL. If you have received this e-mail in error, please notify the sender and immediately,
destroy all copies of this email and its attachments. The publication, copying, in whole or
in part, or use or dissemination in any other way of this e-mail and attachments by anyone
other than the intended person(s) is prohibited.

-----Original Message-----
From: G, Rajesh [mailto:rg@cebglobal.com]
Sent: Thursday, May 5, 2016 4:29 PM
To: Ahmet Arslan <iorixxx@yahoo.com>; solr-user@lucene.apache.org; erickerickson@gmail.com
Subject: RE: Facet ignoring repeated word

Hi,

TermVectorComponent works. I am able to find the repeating words within the same document...that
facet was not able to. The problem I see is TermVectorComponent produces result by a document
e.g. and I have to combine the counts i.e count of word my is=6 in the list of documents.
Can you please suggest a solution to group count by word across documents?. Basically we want
to build word cloud from Solr result

<lst name="1675">
<str name="uniqueKey">1675</str>
        <lst name="comments">
                <lst name="my">
                        <int name="tf">4</int>
                </lst>
        </lst>
</lst>

<lst name="1781">
<str name="uniqueKey">1675</str>
        <lst name="comments">
                <lst name="my">
                        <int name="tf">2</int>
                </lst>
        </lst>
</lst>

http://localhost:8182/solr/dev/tvrh?q=*:*&tv=true&tv.fl=comments&tv.tf=true&fl=comments&rows=1000


Hi Erick,
I need the count of repeated words to build word cloud

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th
Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may
contain confidential and legally privileged information belonging to CEB and/or its subsidiaries,
including SHL. If you have received this e-mail in error, please notify the sender and immediately,
destroy all copies of this email and its attachments. The publication, copying, in whole or
in part, or use or dissemination in any other way of this e-mail and attachments by anyone
other than the intended person(s) is prohibited.

-----Original Message-----
From: Ahmet Arslan [mailto:iorixxx@yahoo.com]
Sent: Tuesday, May 3, 2016 6:19 AM
To: solr-user@lucene.apache.org; G, Rajesh <rg@cebglobal.com>
Subject: Re: Facet ignoring repeated word

Hi,

StatsComponent does not respect the query parameter. However you can feed a function query
(e.g., termfreq) to it.

Instead consider using TermVectors or MLT's interesting terms.


https://cwiki.apache.org/confluence/display/solr/The+Term+Vector+Component
https://cwiki.apache.org/confluence/display/solr/MoreLikeThis

Ahmet


On Monday, May 2, 2016 9:31 AM, "G, Rajesh" <rg@cebglobal.com> wrote:
Hi Erick/ Ahmet,

Thanks for your suggestion. Can we have a query in TermsComponent like. I need the word count
of comments for a question id not all. When I include the query q=questionid=123 I still see
count of all

http://localhost:8182/solr/dev/terms?terms.fl=comments&terms=true&terms.limit=1000&q=questionid=123

StatsComponent is not supporting text fields

Field type textcloud_en{class=org.apache.solr.schema.TextField,analyzer=org.apache.solr.analysis.TokenizerChain,args={positionIncrementGap=100,
class=solr.TextField}} is not currently supported

  <fieldType name="textcloud_en" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
          <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.StandardTokenizerFactory"/>
          <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

Thanks
Rajesh



CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office: 6th
Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India.

This e-mail and/or its attachments are intended only for the use of the addressee(s) and may
contain confidential and legally privileged information belonging to CEB and/or its subsidiaries,
including CEB subsidiaries that offer SHL Talent Measurement products and services. If you
have received this e-mail in error, please notify the sender and immediately, destroy all
copies of this email and its attachments. The publication, copying, in whole or in part, or
use or dissemination in any other way of this e-mail and attachments by anyone other than
the intended person(s) is prohibited.


-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com]
Sent: Friday, April 29, 2016 9:16 PM
To: solr-user <solr-user@lucene.apache.org>; Ahmet Arslan <iorixxx@yahoo.com>
Subject: Re: Facet ignoring repeated word

That's the way faceting is designed to work. It counts the _documents_ that a term appears
in that satisfy your query, if a word appears multiple times in a doc, it'll only count it
once.

For the general use-case it'd be unsettling for a user to see a facet count of 500, then click
on it and discover that the number of docs in the corpus was really 345 or something.

Ahmet's hints might help, but I'd really ask if counting words multiple times really satisfies
the use case.

Best,
Erick

On Fri, Apr 29, 2016 at 7:10 AM, Ahmet Arslan <iorixxx@yahoo.com.invalid> wrote:
> Hi,
>
> Depending on your requirements; StatsComponent, TermsComponent, LukeRequestHandler can
also be used.
>
>
> https://cwiki.apache.org/confluence/display/solr/The+Terms+Component
> https://wiki.apache.org/solr/LukeRequestHandler
> https://cwiki.apache.org/confluence/display/solr/The+Stats+Component
> Ahmet
>
>
>
> On Friday, April 29, 2016 11:56 AM, "G, Rajesh" <rg@cebglobal.com> wrote:
> Hi,
>
> I am trying to implement word cloud<https://www.google.co.uk/imgres?imgurl=https%3A%2F%2Fwww.whitehouse.gov%2Fsites%2Fdefault%2Ffiles%2Fother%2Fsotu_wordle.png&imgrefurl=https%3A%2F%2Fwww.whitehouse.gov%2Fblog%2F2011%2F01%2F26%2Fstate-union-word-cloud-jobs-america-people-new&docid=eZ_HvQpd9FRBKM&tbnid=qyIc-elv6z-0iM%3A&w=895&h=406&bih=643&biw=1366&ved=0ahUKEwie_8XjurPMAhXLaRQKHWiFDFAQMwgyKAAwAA&iact=mrc&uact=8>
 using Solr.  The problem I have is Solr facet query ignores repeated words in a document
eg.
>
> I have indexed the text :
> It seems that the harder I work, the more work I get for the same compensation and reward.
The more work I take on gets absorbed into my "normal" workload and I'm not recognized for
working harder than my peers, which makes me not want to work to my potential. I am very underwhelmed
by the evaluation process and bonus structure. I don't believe the current structure rewards
strong performers. I am confident that the company could not hire someone with my talent to
replace me if I left, but I don't think the company realizes that.
>
> The indexed content has word my and the count the is 3 but when I run the query http://localhost:8182/solr/dev/select?facet=true&facet.field=comments&rows=0&indent=on&q=questionid:3956&wt=json
the count of word my  is 1 and not 3. Can you please help?
>
> Also please suggest If there is a better way to implement word cloud in Solr other than
using facet?
>
>     "facet_fields":{
>       "comments":[
>         "absorbed",1,
>         "am",1,
>         "believe",1,
>         "bonus",1,
>         "company",1,
>         "compensation",1,
>         "confident",1,
>         "could",1,
>         "current",1,
>         "don't",1,
>         "evaluation",1,
>         "get",1,
>         "gets",1,
>         "harder",1,
>         "hire",1,
>         "i",1,
>         "i'm",1,
>         "left",1,
>         "makes",1,
>         "me",1,
>         "more",1,
>         "my",1,
>         "normal",1,
>         "peers",1,
>         "performers",1,
>         "potential",1,
>         "process",1,
>         "realizes",1,
>         "recognized",1,
>         "replace",1,
>         "reward",1,
>         "rewards",1,
>         "same",1,
>         "seems",1,
>         "someone",1,
>         "strong",1,
>         "structure",1,
>         "take",1,
>         "talent",1,
>         "than",1,
>         "think",1,
>         "underwhelmed",1,
>         "very",1,
>         "want",1,
>         "which",1,
>         "work",1,
>         "working",1,
>         "workload",1]
>     }
>
>
>
>
> CEB India Private Limited. Registration No: U741040HR2004PTC035324. Registered office:
6th Floor, Tower B, DLF Building No.10 DLF Cyber City, Gurgaon, Haryana-122002, India..
>
>
>
> This e-mail and/or its attachments are intended only for the use of the addressee(s)
and may contain confidential and legally privileged information belonging to CEB and/or its
subsidiaries, including CEB subsidiaries that offer SHL Talent Measurement products and services.
If you have received this e-mail in error, please notify the sender and immediately, destroy
all copies of this email and its attachments. The publication, copying, in whole or in part,
or use or dissemination in any other way of this e-mail and attachments by anyone other than
the intended person(s) is prohibited.
Mime
View raw message