lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yavar Husain <yavarhus...@gmail.com>
Subject Solr Clustering component different results than Carrot workbench
Date Mon, 18 Aug 2014 09:30:29 GMT
Though I am interacting with Dawid (creator of Carrot2) on Carrot2 mailing
list however just wanted to post my problem to a wider audience.

I am using Solr 4.7 (on both windows and linux) and saved my
lingo-attributes.xml file from the workbench which I am using in Solr. Note
that for testing I am just having one solr Index and all the queries are
getting fired on that.

Now the clusters that I am getting are good in the workbench (carrot) but
pathetic in Solr. In the logs (jetty) I can see:

Loaded Solr resource: clustering/carrot2/lingo-attributes.xml, so that
indicates that my attribute file is being loaded.

I am really confused what is accounting for the difference in the two
outputs (workbench vs Solr). Again to reiterate the data sources are same
(just one solr index and same queries with 100 results). This is happening
on both Linux and Windows.

Given below is my search component and request handler configuration:

<searchComponent name="clustering"
                   enable="${solr.clustering.enabled:true}"
                   class="solr.clustering.ClusteringComponent" >
    <lst name="engine">
      <str name="name">lingo</str>

      <!-- Class name of a clustering algorithm compatible with the Carrot2
framework.

           Currently available open source algorithms are:
           * org.carrot2.clustering.lingo.LingoClusteringAlgorithm
           * org.carrot2.clustering.stc.STCClusteringAlgorithm
           *
org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm

           See http://project.carrot2.org/algorithms.html for more
information.

           A commercial algorithm Lingo3G (needs to be installed
separately) is defined as:
           * com.carrotsearch.lingo3g.Lingo3GClusteringAlgorithm
        -->
      <str
name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
  <str name="LingoClusteringAlgorithm.desiredClusterCountBase">30</str>


      <!-- Override location of the clustering algorithm's resources
           (attribute definitions and lexical resources).

           A directory from which to load algorithm-specific stop words,
           stop labels and attribute definition XMLs.

           For an overview of Carrot2 lexical resources, see:

http://download.carrot2.org/head/manual/#chapter.lexical-resources

           For an overview of Lingo3G lexical resources, see:

http://download.carrotsearch.com/lingo3g/manual/#chapter.lexical-resources
       -->
      <str name="carrot.resourcesDir">clustering/carrot2</str>
    </lst>


  </searchComponent>

  <!-- A request handler for demonstrating the clustering component

       This is purely as an example.

       In reality you will likely want to add the component to your
       already specified request handlers.
    -->
  <requestHandler name="/clustering"
                  enable="${solr.clustering.enabled:true}"
                  class="solr.SearchHandler">
    <lst name="defaults">
      <bool name="clustering">true</bool>
      <bool name="clustering.results">true</bool>
      <!-- Field name with the logical "title" of a each document
(optional) -->
  <str
name="carrot.algorithm">org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str>
  <str name="carrot.resourcesDir">clustering/carrot2</str>
      <str name="carrot.title">film_id</str>
      <!-- Field name with the logical "content" of a each document
(optional) -->
      <str name="carrot.snippet">description</str>
      <!-- Apply highlighter to the title/ content and use this for
clustering. -->
      <bool name="carrot.produceSummary">true</bool>
      <!-- the maximum number of labels per cluster -->
      <!--<int name="carrot.numDescriptions">5</int>-->
      <!-- produce sub clusters -->
      <bool name="carrot.outputSubClusters">false</bool>
      <str name="rows">100</str>
    </lst>
    <arr name="last-components">
      <str>clustering</str>
    </arr>
  </requestHandler>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message