manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ameya Aware <ameya.aw...@gmail.com>
Subject Re: Crawling and indexing very slow
Date Thu, 31 Jul 2014 20:17:06 GMT
I could not exactly locate the position where this is happening.

Can you please help me out with the changes?

Thanks,
Ameya


On Thu, Jul 31, 2014 at 4:10 PM, Karl Wright <daddywri@gmail.com> wrote:

> Hi Ameya,
>
> Since you are already modifying the connector for your purposes, nothing
> is stopping you from modifying it further to not fetch the document and
> instead substitute an empty input stream.
>
> Karl
>
>
>
> On Thu, Jul 31, 2014 at 3:03 PM, Ameya Aware <ameya.aware@gmail.com>
> wrote:
>
>> Hi,
>>
>> i have modified code a little to add different metadata fields such as
>> below (FileConnector.java):
>>
>>                     data.addField("created", new
>> Date((attr.creationTime().toMillis())));
>>                    data.addField("last_accessed", new
>> Date(attr.lastAccessTime().toMillis()));
>>                     data.addField("last_modified", new
>> Date(file.lastModified()));
>>                     data.addField("size", file.length());
>>
>>
>> which are being passed to Solr.
>>
>> Now can i stop MCF from reading a file and sending that content and just
>> passed above information to Solr?
>>
>>
>> Thanks,
>> Ameya
>>
>>
>> On Thu, Jul 31, 2014 at 2:57 PM, Karl Wright <daddywri@gmail.com> wrote:
>>
>>> Hi Ameya,
>>>
>>> The file system connector does not retrieve any metadata for a document
>>> at all.  So I'm not sure what metadata you are talking about.
>>>
>>> Karl
>>>
>>>
>>>
>>> On Thu, Jul 31, 2014 at 2:44 PM, Ameya Aware <ameya.aware@gmail.com>
>>> wrote:
>>>
>>>> So the thing here is i am not looking for any data or content of any of
>>>> files. I am just interested in metadata of file.
>>>>
>>>> So i thought it should be possible to not read any file and just get
>>>> metadata of file and give to Solr.
>>>>
>>>> This should save lots of time.
>>>>
>>>> Is it possible to do this?
>>>>
>>>> Thanks,
>>>> Ameya
>>>>
>>>>
>>>>
>>>> On Thu, Jul 31, 2014 at 2:13 PM, Karl Wright <daddywri@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Ameya,
>>>>>
>>>>> (1) Please look at the Simple History report.  Note what kinds of
>>>>> documents are being fetched, what kinds are being indexed, and how long
it
>>>>> is taking.  I have noted from your previous posts that you seem to be
>>>>> indexing a lot of very large EXE files.  This is useless and you should
be
>>>>> excluding them.
>>>>>
>>>>> (2) Please look in the manifoldcf.log file for evidence that fetches
>>>>> and/or Solr indexing requests are being retried due to errors.  It doesn't
>>>>> take many documents being chronically retried before forward progress
drops
>>>>> to near zero.
>>>>>
>>>>> (3) If you look into (1) & (2) and everything seems fine, it may
be a
>>>>> misalignment between availability of several kinds of resources that
is the
>>>>> problem.  Please get a thread dump of the agents process while it is
>>>>> crawling, using jstack.  Post that thread dump and we can tell you what
to
>>>>> look at next.
>>>>>
>>>>> Karl
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 31, 2014 at 2:07 PM, Ameya Aware <ameya.aware@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> I am using filesystem connector to index my entire C drive using
Solr
>>>>>> as output connector.
>>>>>>
>>>>>> Initial 100000 documents were crawled and indexed successfully in
>>>>>> couple of hours but after that indexing slowed down badly (around
15-20
>>>>>> documents per min).
>>>>>>
>>>>>>
>>>>>> I am not able to figure out whether there is issue with MCF or Solr.
>>>>>>
>>>>>>
>>>>>> Can you advice me how to proceed with this?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Ameya
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message