lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Grant Ingersoll (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
Date Sun, 05 Dec 2010 23:24:13 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12967046#action_12967046
] 

Grant Ingersoll commented on SOLR-1979:
---------------------------------------

bq. @Grant: I actually planned to do the regEx based field name mapping in a separate UpdateProcessor,
to make things more flexible

I don't really see that it makes it any more flexible.  If it was a general purpose mapper,
maybe, but since it is tied to the language field, why not just put in the language processor?
 I've already got the method that choose the output field as a protected.  With that, one
merely would need to extend it to provide an alternate method from what you have proposed.

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan H√łydahl
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act upon it, such
as indexing the content into language aware fields. Another usecase is to be able to filter/facet
on language on random unstructured content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor
is configurable like this:
> {code:xml} 
>   <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code} 
> It will then read the text from inputFields name and subject, perform language identification
and output the ISO code for the detected language in the outputField. If no language was detected,
fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message