lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Erick Erickson <erickerick...@gmail.com>
Subject Re: [jira] [Commented] (SOLR-11869) Remote streaming UpdateRequestProcessor
Date Thu, 18 Jan 2018 15:57:58 GMT
Dirk:

Just skimmed your first post. At a bit higher level, if you're running
Tika on the Solr server, that usually doesn't scale well for two
reasons
1> it puts a lot of CPU intensive work on the Solr box
2> Tika sometimes hits OOMs, loops and the like. It has to deal with a
_ton_ of wonky implementations of ill-defined specs.

I'm not quite sure if this is germane to your question, but if so and
you can move your Tika processing off to an external client or service
that might be a better way to go...

Best,
Erick

On Thu, Jan 18, 2018 at 6:15 AM, Dirk Rudolph (JIRA) <jira@apache.org> wrote:
>
>     [ https://issues.apache.org/jira/browse/SOLR-11869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16330553#comment-16330553
]
>
> Dirk Rudolph commented on SOLR-11869:
> -------------------------------------
>
> I see. So I will start without taking care of the document being fully read into memory
or not.
>
> Anyway, would that kind of UpdateRequestProcessor be interesting for solr or am I the
only one facing that use case?
>
>> Remote streaming UpdateRequestProcessor
>> ---------------------------------------
>>
>>                 Key: SOLR-11869
>>                 URL: https://issues.apache.org/jira/browse/SOLR-11869
>>             Project: Solr
>>          Issue Type: Improvement
>>      Security Level: Public(Default Security Level. Issues are Public)
>>          Components: UpdateRequestProcessors
>>            Reporter: Dirk Rudolph
>>            Priority: Minor
>>
>> When indexing documents from content management systems (or digital asset management
systems) they usually have fields for metadata given by an editor and they in case of pdfs,
docx or any other text formats may also contain the binary content as well, which might be
parsed to plain text using tika. This is whats currently supported by the ExtractingRequestHandler.
>> We are now facing situations where we are indexing batches of documents using the
UpdateRequestHandler and want to send the binary content of the documents mentioned above
as part of the single request to the UpdateRequestHandler. As those documents might be of
unknown size and its difficult to send streams along the wire with javax.json APIs, I though
about sending the url to the document itself, let solr fetch the document and let it be parsed
by tika - using a RemoteStreamingUpdateRequestProcessor.
>> Example:
>> {code:json}
>> {
>>  "add": { "id": "doc1", "meta": "foo", "meta": "bar", "text": "Short text" }
>>  "add": { "id": "doc2", "meta": "will become long", "text_ref": "http://..." }
>> }
>> {code}
>
>
>
> --
> This message was sent by Atlassian JIRA
> (v7.6.3#76005)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message