lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Muir (JIRA)" <>
Subject [jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
Date Sun, 05 Dec 2010 21:05:13 GMT


Robert Muir commented on SOLR-1979:

bq. Yeah, that makes sense, however, I believe Tika returns 639.

Right, but 639 is just a subset of 3066 etc. 

So, ignore what tika does. its 639 identifiers are also valid 3066.

Our API should at least be 3066, Java7/ICU already support BCP47 locale identifiers etc, so
you get the normalization there for free.

It would probably also be nice to be able to map a number of languages to a single field....
say you have a single analyzer that can handle CJK, then you may want that whole collection
of languages mapped to a single _cjk field.

And just because you can detect a language doesn't mean you know how to handle it differently...
so also have an optional catchall that handles all languages not specifically mapped.

Both of these are good reasons why we must avoid 639-1.
We should be able to use things like macrolanguages and undetermined language.

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>                 Key: SOLR-1979
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan H√łydahl
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-1979.patch, SOLR-1979.patch
> We need the ability to detect language of some random text in order to act upon it, such
as indexing the content into language aware fields. Another usecase is to be able to filter/facet
on language on random unstructured content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor
is configurable like this:
> {code:xml} 
>   <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code} 
> It will then read the text from inputFields name and subject, perform language identification
and output the ISO code for the detected language in the outputField. If no language was detected,
fallback language is used.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message