lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Tomás Fernández Löbbe <tomasflo...@gmail.com>
Subject Re: Language Detection Individual Field Mapping Bug
Date Fri, 27 Jan 2017 17:27:09 GMT
Thanks Will,
This does look like a bug and I also couldn't find a Jira issue for it.
Feel free to create one.

Tomás

On Mon, Jan 23, 2017 at 10:37 PM, Will Martin <williammartinthird@gmail.com>
wrote:

> Hello,
>
> While using Solr 6.0.4 I noticed that the org.apache.solr.update.
> processor.LangDetectLanguageIdentifierUpdateProcessor has a bug in it
> where it does not respect the "langid.map.individual" parameter in
> solrconfig.xml. The documentation for langid.map.individual
> <https://wiki.apache.org/solr/LanguageDetection#langid.map.individual>
> specifies:
>
> If you require detecting languages separately for each field, supply
>> langid.map.individual=true. The supplied fields will then be renamed
>> according to detected language on an individual field basis.
>>
>
> However, when this field is set to "true" the fields are still mapped to
> the language code of the entire document. For example: With the following
> snippet from solrconfig.xml
>
> <processor class="org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessorFactory">
>    <lst name="defaults">
>      <str name="langid.fl">title,text</str>
>      <str name="langid.langField">language_s</str>
>      <bool name="langid.map">true</bool>
>      <bool name="langid.map.individual">true</bool>
>    </lst></processor>
>
> a document that takes the form
>
> {
>   "title": "This is an English title",
>   "text": "Pero el texto de este documento está en español."
> }
>
> will be turned into
>
> {
>   "title_es": "This is an english title",
>   "text_es": "Pero el texto de este documento está en español.",
>   "language_s": ["es"]
> }
>
> rather than
>
> {
>   "title_en": "This is an english title",
>   "text_es": "Pero el texto de este documento está en español.",
>   "language_s": ["es","en"]
> }
>
> during processing.
>
> This bug seems to have been introduced in SOLR-3881
> <https://issues.apache.org/jira/browse/SOLR-3881> when the abstract
> method (LangDetectLanguageIdentifierUpdateProcessor.java:52)
>
> protected List<DetectedLanguage> detectLanguage(String content)
>
> was changed to the signature
>
> protected List<DetectedLanguage> detectLanguage(SolrInputDocument doc)
>
> which does not allow one to recognize individual fields while preforming
> language detection. As it stands, the entire document is analysed per
> individual field (included in the "langid.fl" or "langid.map.individual.fl"
> parameters) and the field is mapped to the language of the entire document.
>
> I searched the Apache Jira for a ticket tracking this bug but did not find
> anything that seemed related. I thought before filing a new ticket I would
> ping this mailing list to see if anyone knows about work relating to this
> issue or if there is already a ticket for it (not directly related to the
> term "langid.map.individual" perhaps). If not I can go ahead and file the
> ticket.
>
>
> Thanks,
>
> -William Martin
>

Mime
View raw message