lucene-solr-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Smith <dsmiths...@yahoo.com.INVALID>
Subject Trouble getting "langid.map.individual" setting to work in Solr 5.0.x
Date Mon, 03 Aug 2015 14:56:21 GMT
I am trying to use “languid.map.individual” setting to allow field “a” to detect as,
say, English, and be mapped to “a_en”, while in the same document, field “b” detects
as, say, German and is mapped to “b_de”.

What happens in my tests is that the global language is detected (for example, German), but
BOTH fields are mapped to “_de” as a result.  I cannot get individual detection or mapping
to work.  Am I mis-understanding the purpose of this setting?

Here is the resulting document from my test:

----------------
      {
        "id": "1005!22345",
        "language": [
          "de"
        ],
        "a_de": "A title that should be detected as English with high confidence",
        "b_de": "Die Einführung einer anlasslosen Speicherung von Passagierdaten für alle
Flüge aus einem Nicht-EU-Staat in die EU und umgekehrt ist näher gerückt. Der Ausschuss
des EU-Parlaments für bürgerliche Freiheiten, Justiz und Inneres (LIBE) hat heute mit knapper
Mehrheit für einen entsprechenden Richtlinien-Entwurf der EU-Kommission gestimmt. Bürgerrechtler,
Grüne und Linke halten die geplante Richtlinie für eine andere Form der anlasslosen Vorratsdatenspeicherung,
die alle Flugreisenden zu Verdächtigen mache.",
        "_version_": 1508494723734569000
      }
----------------

I expected “a_de” to be “a_en”, and the “language” multi-valued field to have
“en” and “de”.

Here is my configuration in solrconfig.xml:

--------------------
    <updateRequestProcessorChain name="langid" default="true">
        <processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
            <lst name="defaults">
                <str name="langid">true</str>
                <str name="langid.fl">a,b</str>
                <str name="langid.map">true</str>
                <str name="langid.map.individual">true</str>
                <str name="langid.langField">language</str>
                <str name="langid.map.lcmap">af:uns,ar:uns,bg:uns,bn:uns,cs:uns,da:uns,el:uns,et:uns,fa:uns,fi:uns,gu:uns,he:uns,hi:uns,hr:uns,hu:uns,id:uns,ja:uns,kn:uns,ko:uns,lt:uns,lv:uns,mk:uns,ml:uns,mr:uns,ne:uns,nl:uns,no:uns,pa:uns,pl:uns,ro:uns,ru:uns,sk:uns,sl:uns,so:uns,sq:uns,sv:uns,sw:uns,ta:uns,te:uns,th:uns,tl:uns,tr:uns,uk:uns,ur:uns,vi:uns,zh-cn:uns,zh-tw:uns</str>
                <str name="langid.fallback">en</str>
            </lst>
        </processor>
        <processor class="solr.LogUpdateProcessorFactory" />
        <processor class="solr.RunUpdateProcessorFactory" />
    </updateRequestProcessorChain>
--------------------


The debug output of lang detect, during indexing, is as follows:

-------------------
DEBUG - 2015-08-03 14:37:54.450; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor;
Language detected de with certainty 0.9999964723182276
DEBUG - 2015-08-03 14:37:54.450; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor;
Detected main document language from fields [a, b]: de
DEBUG - 2015-08-03 14:37:54.450; org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor;
Appending field a
DEBUG - 2015-08-03 14:37:54.451; org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor;
Appending field b
DEBUG - 2015-08-03 14:37:54.453; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor;
Language detected de with certainty 0.9999964723182276
DEBUG - 2015-08-03 14:37:54.453; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor;
Mapping field a using individually detected language de
DEBUG - 2015-08-03 14:37:54.454; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor;
Doing mapping from a with language de to field a_de
DEBUG - 2015-08-03 14:37:54.454; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor;
Mapping field 1005!22345 to de
DEBUG - 2015-08-03 14:37:54.454; org.eclipse.jetty.webapp.WebAppClassLoader; loaded class
org.apache.solr.common.SolrInputField from WebAppClassLoader=525571@80503
DEBUG - 2015-08-03 14:37:54.454; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor;
Removing old field a
DEBUG - 2015-08-03 14:37:54.455; org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor;
Appending field a
DEBUG - 2015-08-03 14:37:54.455; org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor;
Appending field b
DEBUG - 2015-08-03 14:37:54.456; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor;
Language detected de with certainty 0.9999980402022373
DEBUG - 2015-08-03 14:37:54.456; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor;
Mapping field b using individually detected language de
DEBUG - 2015-08-03 14:37:54.456; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor;
Doing mapping from b with language de to field b_de
DEBUG - 2015-08-03 14:37:54.456; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor;
Mapping field 1005!22345 to de
DEBUG - 2015-08-03 14:37:54.456; org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor;
Removing old field b
-------------

From this, my takeaway is that every time the LangDetectLanguageIdentifierUpdateProcessor
is asked to detect the language, it is using field a AND b.  But I can’t quite tell from
this output.

Any insight appreciated.

Regards,

David



Mime
View raw message