lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yonik Seeley (JIRA)" <j...@apache.org>
Subject [jira] Commented: (SOLR-1979) Create LanguageIdentifierUpdateProcessor
Date Mon, 06 Dec 2010 22:15:11 GMT

    [ https://issues.apache.org/jira/browse/SOLR-1979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12968445#action_12968445
] 

Yonik Seeley commented on SOLR-1979:
------------------------------------

bq. In skimming the current patch, it looks like fields get mapped no matter what. What if
I just want the language detected and added as another field, but no field mapping desired?

Yeah, that's sort of in line with my:
bq. And just because you can detect a language doesn't mean you know how to handle it differently...
so also have an optional catchall that handles all languages not specifically mapped.

So for all unmapped languages, you may want to map to a single generic field, or not map at
all (leave field as is).
I guess it also depends on the general strategy... if you are detecting language on the "body"
field, are we using a copyField type approach and only storing the body field while indexing
as body_enText, or are we moving the field from "body" to "body_enText"?

bq. Also, if there are multiple input fields, the current patch would create multiple language
field values requiring that field to be multi-valued. Is the goal here to identify a single
language for a document?

I could see both making sense.

> Create LanguageIdentifierUpdateProcessor
> ----------------------------------------
>
>                 Key: SOLR-1979
>                 URL: https://issues.apache.org/jira/browse/SOLR-1979
>             Project: Solr
>          Issue Type: New Feature
>          Components: update
>            Reporter: Jan H√łydahl
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch, SOLR-1979.patch
>
>
> We need the ability to detect language of some random text in order to act upon it, such
as indexing the content into language aware fields. Another usecase is to be able to filter/facet
on language on random unstructured content.
> To do this, we wrap the Tika LanguageIdentifier in an UpdateProcessor. The processor
is configurable like this:
> {code:xml} 
>   <processor class="org.apache.solr.update.processor.LanguageIdentifierUpdateProcessorFactory">
>     <str name="inputFields">name,subject</str>
>     <str name="outputField">language_s</str>
>     <str name="idField">id</str>
>     <str name="fallback">en</str>
>   </processor>
> {code} 
> It will then read the text from inputFields name and subject, perform language identification
and output the ISO code for the detected language in the outputField. If no language was detected,
fallback language is used.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message