lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <>
Subject [jira] Updated: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
Date Sun, 05 Dec 2010 16:16:13 GMT


Grant Ingersoll updated SOLR-1979:

    Attachment: SOLR-1979.patch

I took Jan's and Tommaso's patches and reworked them a bit.  It seems to me that there isn't
much point in merely identifying the language if you aren't going to do something about it.
 So, this patch builds on what Jan and Tommaso did and then will remap the input fields to
new per language fields (note, we could make this optional).  I also tried to standardize
the input parameters a bit.  I dropped the outputField setting and a number of other settings
and I made the language detection to be per input field.  The basic gist of it is that if
you input two fields: name, subject, it will detect the language of each field and then attempt
to map them to a new field.  The new field is made by concatenating the original field name
with "_" + the ISO 639 code.  For example, if en is the detected language, then the new field
for name would be name_en.  If that field doesn't exist, it will fall back to the original
field (i.e. name).

Left to do:
# Fix the tests.  I don't like how we currently tests UpdateProcessorChains.  It should not
require writing your own little piece of update mechanism.  You should be able to simply setup
the appropriate configuration, hook it into an update handler and then hit that update handler.
# Need to check the license headers, builds, etc.

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>                 Key: SOLR-1979
>                 URL:
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan H√łydahl
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-1979.patch, SOLR-1979.patch
> We need the ability to detect language of some random text in order to act upon it, such
as indexing the content into language aware fields. Another usecase is to be able to filter/facet
on language on random unstructured content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor
is configurable like this:
> {code:xml} 
>   <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code} 
> It will then read the text from inputFields name and subject, perform language identification
and output the ISO code for the detected language in the outputField. If no language was detected,
fallback language is used.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message