lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Wael Kader <w...@softech-lb.com>
Subject Re: Faceting Word Count
Date Wed, 08 Nov 2017 14:58:11 GMT
Hi,

I want to know the best option for getting word cloud in SOLR.
Is it saving the data as multivalued, using vector, JSON faceting(didn't
work with me)? Terms doesn't work because I can't provide any criteria.

I don't mind changing the design but I need to know the best feasible way
that won't make any problems on the long run.
I want to be able to get the word frequency based on a criteria. Facets are
taking around 1 minute to return data now.

Regards,
Wael

On Wed, Nov 8, 2017 at 11:06 AM, Emir Arnautović <
emir.arnautovic@sematext.com> wrote:

> Hi Wael,
> You can try out JSON faceting - it’s not just about rq/resp format, but it
> uses different implementation as well. In any case you will have to index
> documents differently in order to be able to use docValues.
>
> HTH
> Emir
> --
> Monitoring - Log Management - Alerting - Anomaly Detection
> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
> > On 7 Nov 2017, at 09:26, Wael Kader <wael@softech-lb.com> wrote:
> >
> > Hi,
> >
> > The whole index has 100M but when I add the criteria, it will filter the
> > data to maybe 10k as a max number of rows.
> > The facet isn't working when the total number of records in the index is
> > 100M but it was working at 5M.
> >
> > I have social media & RSS data in the index and I am trying to get the
> word
> > count for a specific user on specific date intervals.
> >
> > Regards,
> > Wael
> >
> > On Mon, Nov 6, 2017 at 3:42 PM, Erick Erickson <erickerickson@gmail.com>
> > wrote:
> >
> >> _Why_ do you want to get the word counts? Faceting on all of the
> >> tokens for 100M docs isn't something Solr is ordinarily used for. As
> >> Emir says it'll take a huge amount of memory. You can use one of the
> >> function queries (termfreq IIRC) that will give you the count of any
> >> individual term you have and will be very fast.
> >>
> >> But getting all of the word counts in the index is probably not
> >> something I'd use Solr for.
> >>
> >> This may be an XY problem, you're asking how to do something specific
> >> (X) without explaining what the problem you're trying to solve is (Y).
> >> Perhaps there's another way to accomplish (Y) if we knew more about
> >> what it is.
> >>
> >> Best,
> >> Erick
> >>
> >>
> >>
> >> On Mon, Nov 6, 2017 at 4:15 AM, Emir Arnautović
> >> <emir.arnautovic@sematext.com> wrote:
> >>> Hi Wael,
> >>> You are faceting on analyzed field. This results in field being
> >> uninverted - fieldValueCache being built - on first call after every
> >> commit. This is both time and memory consuming (you can check in admin
> >> console in stats how much memory it took).
> >>> What you need to do is to create multivalue string field (not text) and
> >> parse values (do analysis steps) on client side and store it like that.
> >> This will allow you to enable docValues on that field and avoid building
> >> fieldValueCache.
> >>>
> >>> HTH,
> >>> Emir
> >>> --
> >>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>> Solr & Elasticsearch Consulting Support Training -
> http://sematext.com/
> >>>
> >>>
> >>>
> >>>> On 6 Nov 2017, at 13:06, Wael Kader <wael@softech-lb.com> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> I am using a custom field. Below is the field definition.
> >>>> I am using this because I don't want stemming.
> >>>>
> >>>>
> >>>>   <fieldType name="text_no_stem2" class="solr.TextField"
> >>>> positionIncrementGap="100">
> >>>>     <analyzer type="index">
> >>>>       <charFilter class="solr.MappingCharFilterFactory"
> >>>> mapping="mapping-ISOLatin1Accent.txt"/>
> >>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>>
> >>>>       <filter class="solr.StopFilterFactory"
> >>>>               ignoreCase="true"
> >>>>               words="stopwords.txt"
> >>>>               enablePositionIncrements="true"
> >>>>               />
> >>>>       <filter class="solr.WordDelimiterFilterFactory"
> >>>>               protected="protwords.txt"
> >>>>               generateWordParts="0"
> >>>>               generateNumberParts="1"
> >>>>               catenateWords="1"
> >>>>               catenateNumbers="1"
> >>>>               catenateAll="0"
> >>>>               splitOnCaseChange="1"
> >>>>               preserveOriginal="1"/>
> >>>>       <filter class="solr.LowerCaseFilterFactory"/>
> >>>>
> >>>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>>>     </analyzer>
> >>>>     <analyzer type="query">
> >>>>       <charFilter class="solr.MappingCharFilterFactory"
> >>>> mapping="mapping-ISOLatin1Accent.txt"/>
> >>>>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
> >>>>       <filter class="solr.SynonymFilterFactory"
> >> synonyms="synonyms.txt"
> >>>> ignoreCase="true" expand="true"/>
> >>>>       <filter class="solr.StopFilterFactory"
> >>>>               ignoreCase="true"
> >>>>               words="stopwords.txt"
> >>>>               enablePositionIncrements="true"
> >>>>               />
> >>>> <!--ORIGINAL                generateNumberParts="1"-->
> >>>>       <filter class="solr.WordDelimiterFilterFactory"
> >>>>               protected="protwords.txt"
> >>>>               generateWordParts="0"
> >>>>               catenateWords="0"
> >>>>               catenateNumbers="0"
> >>>>               catenateAll="0"
> >>>>               splitOnCaseChange="1"
> >>>>               preserveOriginal="1"/>
> >>>>       <filter class="solr.LowerCaseFilterFactory"/>
> >>>>       <!-- ORIGINAL filter class="solr.SnowballPorterFilterFactory"
> >>>> language="English" protected="protwords.txt"/-->
> >>>>       <!-- Webel: switch off Porter-stemmer algorithm to enforce
whole
> >>>> word match -->
> >>>>       <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
> >>>>     </analyzer>
> >>>>   </fieldType>
> >>>>
> >>>>
> >>>> Regards,
> >>>> Wael
> >>>>
> >>>> On Mon, Nov 6, 2017 at 10:29 AM, Emir Arnautović <
> >>>> emir.arnautovic@sematext.com> wrote:
> >>>>
> >>>>> Hi Wael,
> >>>>> Can you provide your field definition and sample query.
> >>>>>
> >>>>> Thanks,
> >>>>> Emir
> >>>>> --
> >>>>> Monitoring - Log Management - Alerting - Anomaly Detection
> >>>>> Solr & Elasticsearch Consulting Support Training -
> >> http://sematext.com/
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On 6 Nov 2017, at 08:30, Wael Kader <wael@softech-lb.com>
wrote:
> >>>>>>
> >>>>>> Hello,
> >>>>>>
> >>>>>> I am having an index with around 100 Million documents.
> >>>>>> I have a multivalued column that I am saving big chunks of text
data
> >> in.
> >>>>> It
> >>>>>> has around 20 GB of RAM and 4 CPU's.
> >>>>>>
> >>>>>> I was doing faceting on it to get word cloud but it was taking
> around
> >> 1
> >>>>>> second to retrieve when the data was 5-10 Million .
> >>>>>> Now I have more data and its taking minutes to get the results
(that
> >> is
> >>>>> if
> >>>>>> it gets it and SOLR doesn't crash). Whats the best way to make
it
> run
> >> or
> >>>>>> maybe its not scalable to make it run on my current schema and
> design
> >>>>> with
> >>>>>> News articles.
> >>>>>>
> >>>>>> I am looking to find the best solution for this. Maybe create
> another
> >>>>> index
> >>>>>> to split the data while inserting it or maybe if I change some
> >> settings
> >>>>> in
> >>>>>> SolrConfig or add some RAM, it would perform better.
> >>>>>>
> >>>>>> --
> >>>>>> Regards,
> >>>>>> Wael
> >>>>>
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> Regards,
> >>>> Wael
> >>>
> >>
> >
> >
> >
> > --
> > Regards,
> > Wael
>
>


-- 
Regards,
Wael

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message