lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jan Høydahl (JIRA) <>
Subject [jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
Date Tue, 07 Dec 2010 16:57:11 GMT


Jan Høydahl commented on SOLR-1979:

>>I have a plan to add profiles for the Norwegian and Sami languages when time allows:
TIKA-491 TIKA-492
>Did you plan to also upgrade tika from 639-1 for the Sami languages? the only 639-1 code
i see is "se" but this seems to be appropriate only for North Sami.

Exactly. That's one example which will need a wider range of codes. I was planning to use
639-2 for those that do not have a 2-letter code, but BCP47 it will be now (although the end
result may be more or less the same)

We also need to detect whether a language is part of a macro language, and add both to languages
multivalue field, because it should be possible to filter on Norwegian (no) without specifying
both nn and nb, and also for sami (smi) without specifying all of the specific languages.

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>                 Key: SOLR-1979
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan Høydahl
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
> We need the ability to detect language of some random text in order to act upon it, such
as indexing the content into language aware fields. Another usecase is to be able to filter/facet
on language on random unstructured content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor
is configurable like this:
> {code:xml} 
>   <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code} 
> It will then read the text from inputFields name and subject, perform language identification
and output the ISO code for the detected language in the outputField. If no language was detected,
fallback language is used.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message