lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Pickler <andy.pick...@gmail.com>
Subject Re: Top 10 Terms in Index (by date)
Date Tue, 02 Apr 2013 00:58:43 GMT
I need "total number of occurrences" across all documents for each term.
Imagine this...

Post #1: "I think, therefore I am like you"
Reply #1: "You think too much"
Reply #2 "I think that I think much as you"

Each of those "documents" are put into 'content'.  Pretending I don't have
stop words, the top term query (not considering dateCreated in this
example) would result in something like...

"think": 4
"I": 4
"you": 3
"much": 2
...

Thus, just a "number of documents" approach doesn't work, because if a word
occurs more than one time in a document it needs to be counted that many
times.  That seemed to rule out faceting like you mentioned as well as the
TermsComponent (which as I understand also only counts "documents").

Thanks,
Andy Pickler

On Mon, Apr 1, 2013 at 4:31 PM, Tomás Fernández Löbbe <tomasflobbe@gmail.com
> wrote:

> So you have one document per user comment? Why not use faceting plus
> filtering on the "dateCreated" field? That would count "number of
> documents" for each term (so, in your case, if a term is used twice in one
> comment it would only count once). Is that what you are looking for?
>
> Tomás
>
>
> On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler <andy.pickler@gmail.com>
> wrote:
>
> > Our company has an application that is "Facebook-like" for usage by
> > enterprise customers.  We'd like to do a report of "top 10 terms entered
> by
> > users over (some time period)".  With that in mind I'm using the
> > DataImportHandler to put all the relevant data from our database into a
> > Solr 'content' field:
> >
> > <field name="content" type="text_general" indexed="true" stored="false"
> > multiValued="false" required="true" termVectors="true"/>
> >
> > Along with the content is the 'dateCreated' for that content:
> >
> > <field name="dateCreated" type="tdate" indexed="true" stored="false"
> > multiValued="false" required="true"/>
> >
> > I'm struggling with the TermVectorComponent documentation to understand
> how
> > I can put together a query that answers the 'report' mentioned above.
>  For
> > each document I need each term counted however many times it is entered
> > (content of "I think what I think" would report 'think' as used twice).
> >  Does anyone have any insight as to whether I'm headed in the right
> > direction and then what my query would be?
> >
> > Thanks,
> > Andy Pickler
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message