lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andy Pickler <andy.pick...@gmail.com>
Subject Re: Top 10 Terms in Index (by date)
Date Tue, 02 Apr 2013 14:12:22 GMT
A key problem with those approaches as well as Lucene's HighFreqTerms class
(
http://lucene.apache.org/core/4_2_0/misc/org/apache/lucene/misc/HighFreqTerms.html)
is that none of them seem to have the ability to combine with a date range
query...which is key in my scenario.  I'm kinda thinking that what I'm
asking to do just isn't supported by Lucene or Solr, and that I'll have to
pursue another avenue.  If anyone has any other suggestions, I'm all ears.
I'm starting to wonder if I need to have some nightly batch job that
executes against my database and builds up "that day's top terms" in a
table or something.

Thanks,
Andy Pickler

On Tue, Apr 2, 2013 at 7:16 AM, Tomás Fernández Löbbe <tomasflobbe@gmail.com
> wrote:

> Oh, I see, essentially you want to get the sum of the term frequencies for
> every term in a subset of documents (instead of the document frequency as
> the FacetComponent would give you). I don't know of an easy/out of the box
> solution for this. I know the TermVectorComponent will give you the tf for
> every term in a document, but I'm not sure if you can filter or sort on it.
> Maybe you can do something like:
> https://issues.apache.org/jira/browse/LUCENE-2393
> or what's suggested here:
> http://search-lucene.com/m/of5Fn1PUOHU/
> but I have never used something like that.
>
> Tomás
>
>
>
> On Mon, Apr 1, 2013 at 9:58 PM, Andy Pickler <andy.pickler@gmail.com>
> wrote:
>
> > I need "total number of occurrences" across all documents for each term.
> > Imagine this...
> >
> > Post #1: "I think, therefore I am like you"
> > Reply #1: "You think too much"
> > Reply #2 "I think that I think much as you"
> >
> > Each of those "documents" are put into 'content'.  Pretending I don't
> have
> > stop words, the top term query (not considering dateCreated in this
> > example) would result in something like...
> >
> > "think": 4
> > "I": 4
> > "you": 3
> > "much": 2
> > ...
> >
> > Thus, just a "number of documents" approach doesn't work, because if a
> word
> > occurs more than one time in a document it needs to be counted that many
> > times.  That seemed to rule out faceting like you mentioned as well as
> the
> > TermsComponent (which as I understand also only counts "documents").
> >
> > Thanks,
> > Andy Pickler
> >
> > On Mon, Apr 1, 2013 at 4:31 PM, Tomás Fernández Löbbe <
> > tomasflobbe@gmail.com
> > > wrote:
> >
> > > So you have one document per user comment? Why not use faceting plus
> > > filtering on the "dateCreated" field? That would count "number of
> > > documents" for each term (so, in your case, if a term is used twice in
> > one
> > > comment it would only count once). Is that what you are looking for?
> > >
> > > Tomás
> > >
> > >
> > > On Mon, Apr 1, 2013 at 6:32 PM, Andy Pickler <andy.pickler@gmail.com>
> > > wrote:
> > >
> > > > Our company has an application that is "Facebook-like" for usage by
> > > > enterprise customers.  We'd like to do a report of "top 10 terms
> > entered
> > > by
> > > > users over (some time period)".  With that in mind I'm using the
> > > > DataImportHandler to put all the relevant data from our database
> into a
> > > > Solr 'content' field:
> > > >
> > > > <field name="content" type="text_general" indexed="true"
> stored="false"
> > > > multiValued="false" required="true" termVectors="true"/>
> > > >
> > > > Along with the content is the 'dateCreated' for that content:
> > > >
> > > > <field name="dateCreated" type="tdate" indexed="true" stored="false"
> > > > multiValued="false" required="true"/>
> > > >
> > > > I'm struggling with the TermVectorComponent documentation to
> understand
> > > how
> > > > I can put together a query that answers the 'report' mentioned above.
> > >  For
> > > > each document I need each term counted however many times it is
> entered
> > > > (content of "I think what I think" would report 'think' as used
> twice).
> > > >  Does anyone have any insight as to whether I'm headed in the right
> > > > direction and then what my query would be?
> > > >
> > > > Thanks,
> > > > Andy Pickler
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message