I've no idea if it's possible but i'd at least try to return an ArrayList of rows instead of
just a single row. And if it doesn't work, which is probably the case, how about filing an
issue in Jira?
Reading the docs in the matter, i think it should (made) to be possible to return multiple
rows in an ArrayList.
-----Original message-----
From: Peter Spam <pspam@mac.com>
Sent: Tue 17-08-2010 00:47
To: solr-user@lucene.apache.org;
Subject: Re: Solr searching performance issues, using large documents
Still stuck on this - any hints on how to write the JavaScript to split a document? Thanks!
-Pete
On Aug 5, 2010, at 8:10 PM, Lance Norskog wrote:
> You may have to write your own javascript to read in the giant field
> and split it up.
>
> On Thu, Aug 5, 2010 at 5:27 PM, Peter Spam <pspam@mac.com> wrote:
>> I've read through the DataImportHandler page a few times, and still can't figure
out how to separate a large document into smaller documents. Any hints? :-) Thanks!
>>
>> -Peter
>>
>> On Aug 2, 2010, at 9:01 PM, Lance Norskog wrote:
>>
>>> Spanning won't work- you would have to make overlapping mini-documents
>>> if you want to support this.
>>>
>>> I don't know how big the chunks should be- you'll have to experiment.
>>>
>>> Lance
>>>
>>> On Mon, Aug 2, 2010 at 10:01 AM, Peter Spam <pspam@mac.com> wrote:
>>>> What would happen if the search query phrase spanned separate document chunks?
>>>>
>>>> Also, what would the optimal size of chunks be?
>>>>
>>>> Thanks!
>>>>
>>>>
>>>> -Peter
>>>>
>>>> On Aug 1, 2010, at 7:21 PM, Lance Norskog wrote:
>>>>
>>>>> Not that I know of.
>>>>>
>>>>> The DataImportHandler has the ability to create multiple documents
>>>>> from one input stream. It is possible to create a DIH file that reads
>>>>> large log files and splits each one into N documents, with the file
>>>>> name as a common field. The DIH wiki page tells you in general how to
>>>>> make a DIH file.
>>>>>
>>>>> http://wiki.apache.org/solr/DataImportHandler
>>>>>
>>>>> From this, you should be able to make a DIH file that puts log files
>>>>> in as separate documents. As to splitting files up into
>>>>> mini-documents, you might have to write a bit of Javascript to achieve
>>>>> this. There is no data structure or software that implements
>>>>> structured documents.
>>>>>
>>>>> On Sun, Aug 1, 2010 at 2:06 PM, Peter Spam <pspam@mac.com> wrote:
>>>>>> Thanks for the pointer, Lance! Is there an example of this somewhere?
>>>>>>
>>>>>>
>>>>>> -Peter
>>>>>>
>>>>>> On Jul 31, 2010, at 3:13 PM, Lance Norskog wrote:
>>>>>>
>>>>>>> Ah! You're not just highlighting, you're snippetizing. This makes
it easier.
>>>>>>>
>>>>>>> Highlighting does not stream- it pulls the entire stored contents
into
>>>>>>> one string and then pulls out the snippet. If you want this
to be
>>>>>>> fast, you have to split up the text into small pieces and only
>>>>>>> snippetize from the most relevant text. So, separate documents
with a
>>>>>>> common group id for the document it came from. You might have
to do 2
>>>>>>> queries to achieve what you want, but the second query for the
same
>>>>>>> query will be blindingly fast. Often <1ms.
>>>>>>>
>>>>>>> Good luck!
>>>>>>>
>>>>>>> Lance
>>>>>>>
>>>>>>> On Sat, Jul 31, 2010 at 1:12 PM, Peter Spam <pspam@mac.com>
wrote:
>>>>>>>> However, I do need to search the entire document, or else
the highlighting will sometimes be blank :-(
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> - Peter
>>>>>>>>
>>>>>>>> ps. sorry for the many responses - I'm rushing around trying
to get this working.
>>>>>>>>
>>>>>>>> On Jul 31, 2010, at 1:11 PM, Peter Spam wrote:
>>>>>>>>
>>>>>>>>> Correction - it went from 17 seconds to 10 seconds -
I was changing the hl.regex.maxAnalyzedChars the first time.
>>>>>>>>> Thanks!
>>>>>>>>>
>>>>>>>>> -Peter
>>>>>>>>>
>>>>>>>>> On Jul 31, 2010, at 1:06 PM, Peter Spam wrote:
>>>>>>>>>
>>>>>>>>>> On Jul 30, 2010, at 1:16 PM, Peter Karich wrote:
>>>>>>>>>>
>>>>>>>>>>> did you already try other values for hl.maxAnalyzedChars=2147483647
>>>>>>>>>>
>>>>>>>>>> Yes, I tried dropping it down to 21, but it didn't
have much of an impact (one search I just tried went from 17 seconds to 15.8 seconds, and
this is an 8-core Mac Pro with 6GB RAM - 4GB for java).
>>>>>>>>>>
>>>>>>>>>>> ? Also regular expression highlighting is more
expensive, I think.
>>>>>>>>>>> What does the 'fuzzy' variable mean? If you use
this to query via
>>>>>>>>>>> "~someTerm" instead "someTerm"
>>>>>>>>>>> then you should try the trunk of solr which is
a lot faster for fuzzy or
>>>>>>>>>>> other wildcard search.
>>>>>>>>>>
>>>>>>>>>> "fuzzy" could be set to "*" but isn't right now.
>>>>>>>>>>
>>>>>>>>>> Thanks for the tips, Peter - this has been very frustrating!
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> - Peter
>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Peter.
>>>>>>>>>>>
>>>>>>>>>>>> Data set: About 4,000 log files (will eventually
grow to millions). Average log file is 850k. Largest log file (so far) is about 70MB.
>>>>>>>>>>>>
>>>>>>>>>>>> Problem: When I search for common terms,
the query time goes from under 2-3 seconds to about 60 seconds. TermVectors etc are enabled.
When I disable highlighting, performance improves a lot, but is still slow for some queries
(7 seconds). Thanks in advance for any ideas!
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -Peter
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> 4GB RAM server
>>>>>>>>>>>> % java -Xms2048M -Xmx3072M -jar start.jar
>>>>>>>>>>>>
>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> schema.xml changes:
>>>>>>>>>>>>
>>>>>>>>>>>> <fieldType name="text_pl" class="solr.TextField">
>>>>>>>>>>>> <analyzer>
>>>>>>>>>>>> <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>>>>>>>>>>>> <filter class="solr.LowerCaseFilterFactory"/>
>>>>>>>>>>>> <filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0"
splitOnCaseChange="0"/>
>>>>>>>>>>>> </analyzer>
>>>>>>>>>>>> </fieldType>
>>>>>>>>>>>>
>>>>>>>>>>>> ...
>>>>>>>>>>>>
>>>>>>>>>>>> <field name="body" type="text_pl" indexed="true"
stored="true" multiValued="false" termVectors="true" termPositions="true" termOffsets="true"
/>
>>>>>>>>>>>> <field name="timestamp" type="date"
indexed="true" stored="true" default="NOW" multiValued="false"/>
>>>>>>>>>>>> <field name="version" type="string" indexed="true"
stored="true" multiValued="false"/>
>>>>>>>>>>>> <field name="device" type="string" indexed="true"
stored="true" multiValued="false"/>
>>>>>>>>>>>> <field name="filename" type="string" indexed="true"
stored="true" multiValued="false"/>
>>>>>>>>>>>> <field name="filesize" type="long" indexed="true"
stored="true" multiValued="false"/>
>>>>>>>>>>>> <field name="pversion" type="int" indexed="true"
stored="true" multiValued="false"/>
>>>>>>>>>>>> <field name="first2md5" type="string"
indexed="false" stored="true" multiValued="false"/>
>>>>>>>>>>>> <field name="ckey" type="string" indexed="true"
stored="true" multiValued="false"/>
>>>>>>>>>>>>
>>>>>>>>>>>> ...
>>>>>>>>>>>>
>>>>>>>>>>>> <dynamicField name="*" type="ignored"
multiValued="true" />
>>>>>>>>>>>> <defaultSearchField>body</defaultSearchField>
>>>>>>>>>>>> <solrQueryParser defaultOperator="AND"/>
>>>>>>>>>>>>
>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> solrconfig.xml changes:
>>>>>>>>>>>>
>>>>>>>>>>>> <maxFieldLength>2147483647</maxFieldLength>
>>>>>>>>>>>> <ramBufferSizeMB>128</ramBufferSizeMB>
>>>>>>>>>>>>
>>>>>>>>>>>> -------------------------------------------------------------------------------------------------------------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>> The query:
>>>>>>>>>>>>
>>>>>>>>>>>> rowStr = "&rows=10"
>>>>>>>>>>>> facet = "&facet=true&facet.limit=10&facet.field=device&facet.field=ckey&facet.field=version"
>>>>>>>>>>>> fields = "&fl=id,score,filename,version,device,first2md5,filesize,ckey"
>>>>>>>>>>>> termvectors = "&tv=true&qt=tvrh&tv.all=true"
>>>>>>>>>>>> hl = "&hl=true&hl.fl=body&hl.snippets=1&hl.fragsize=400"
>>>>>>>>>>>> regexv = "(?m)^.*\n.*\n.*$"
>>>>>>>>>>>> hl_regex = "&hl.regex.pattern=" + CGI::escape(regexv)
+ "&hl.regex.slop=1&hl.fragmenter=regex&hl.regex.maxAnalyzedChars=2147483647&hl.maxAnalyzedChars=2147483647"
>>>>>>>>>>>> justq = '&q=' + CGI::escape('body:' +
fuzzy + p['q'].to_s.gsub(/\\/, '').gsub(/([:~!<>="])/,'\\\\\1') + fuzzy + minLogSizeStr)
>>>>>>>>>>>>
>>>>>>>>>>>> thequery = '/solr/select?timeAllowed=5000&wt=ruby'
+ (p['fq'].empty? ? '' : ('&fq='+p['fq'].to_s) ) + justq + rowStr + facet + fields + termvectors
+ hl + hl_regex
>>>>>>>>>>>>
>>>>>>>>>>>> baseurl = '/cgi-bin/search.rb?q=' + CGI::escape(p['q'].to_s)
+ '&rows=' + p['rows'].to_s + '&minLogSize=' + p['minLogSize'].to_s
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> http://karussell.wordpress.com/
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Lance Norskog
>>>>>>> goksron@gmail.com
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Lance Norskog
>>>>> goksron@gmail.com
>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> Lance Norskog
>>> goksron@gmail.com
>>
>>
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
|