manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ameya Aware <ameya.aw...@gmail.com>
Subject Re: Crawling and indexing very slow
Date Thu, 31 Jul 2014 20:22:10 GMT
>>>>>>>>>>>>>>>>>>>>>>>>>>
                    long fileBytes = file.length();
                    RepositoryDocument data = new RepositoryDocument();
                    data.setBinary(is,fileBytes);
                    String fileName = file.getName();
                    data.setFileName(fileName);
                    data.setMimeType(mapExtensionToMimeType(fileName));

<<<<<<<<<<<<<<<<<<<<<<<<<<<


do i just need to comment out 3rd line i.e. data.setBinary(is,fileBytes); ??


Thanks,
Ameya


On Thu, Jul 31, 2014 at 4:17 PM, Ameya Aware <ameya.aware@gmail.com> wrote:

> I could not exactly locate the position where this is happening.
>
> Can you please help me out with the changes?
>
> Thanks,
> Ameya
>
>
>
> On Thu, Jul 31, 2014 at 4:10 PM, Karl Wright <daddywri@gmail.com> wrote:
>
>> Hi Ameya,
>>
>> Since you are already modifying the connector for your purposes, nothing
>> is stopping you from modifying it further to not fetch the document and
>> instead substitute an empty input stream.
>>
>> Karl
>>
>>
>>
>> On Thu, Jul 31, 2014 at 3:03 PM, Ameya Aware <ameya.aware@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> i have modified code a little to add different metadata fields such as
>>> below (FileConnector.java):
>>>
>>>                     data.addField("created", new
>>> Date((attr.creationTime().toMillis())));
>>>                    data.addField("last_accessed", new
>>> Date(attr.lastAccessTime().toMillis()));
>>>                     data.addField("last_modified", new
>>> Date(file.lastModified()));
>>>                     data.addField("size", file.length());
>>>
>>>
>>> which are being passed to Solr.
>>>
>>> Now can i stop MCF from reading a file and sending that content and just
>>> passed above information to Solr?
>>>
>>>
>>> Thanks,
>>> Ameya
>>>
>>>
>>> On Thu, Jul 31, 2014 at 2:57 PM, Karl Wright <daddywri@gmail.com> wrote:
>>>
>>>> Hi Ameya,
>>>>
>>>> The file system connector does not retrieve any metadata for a document
>>>> at all.  So I'm not sure what metadata you are talking about.
>>>>
>>>> Karl
>>>>
>>>>
>>>>
>>>> On Thu, Jul 31, 2014 at 2:44 PM, Ameya Aware <ameya.aware@gmail.com>
>>>> wrote:
>>>>
>>>>> So the thing here is i am not looking for any data or content of any
>>>>> of files. I am just interested in metadata of file.
>>>>>
>>>>> So i thought it should be possible to not read any file and just get
>>>>> metadata of file and give to Solr.
>>>>>
>>>>> This should save lots of time.
>>>>>
>>>>> Is it possible to do this?
>>>>>
>>>>> Thanks,
>>>>> Ameya
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 31, 2014 at 2:13 PM, Karl Wright <daddywri@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Ameya,
>>>>>>
>>>>>> (1) Please look at the Simple History report.  Note what kinds of
>>>>>> documents are being fetched, what kinds are being indexed, and how
long it
>>>>>> is taking.  I have noted from your previous posts that you seem to
be
>>>>>> indexing a lot of very large EXE files.  This is useless and you
should be
>>>>>> excluding them.
>>>>>>
>>>>>> (2) Please look in the manifoldcf.log file for evidence that fetches
>>>>>> and/or Solr indexing requests are being retried due to errors.  It
doesn't
>>>>>> take many documents being chronically retried before forward progress
drops
>>>>>> to near zero.
>>>>>>
>>>>>> (3) If you look into (1) & (2) and everything seems fine, it
may be a
>>>>>> misalignment between availability of several kinds of resources that
is the
>>>>>> problem.  Please get a thread dump of the agents process while it
is
>>>>>> crawling, using jstack.  Post that thread dump and we can tell you
what to
>>>>>> look at next.
>>>>>>
>>>>>> Karl
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 31, 2014 at 2:07 PM, Ameya Aware <ameya.aware@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>> I am using filesystem connector to index my entire C drive using
>>>>>>> Solr as output connector.
>>>>>>>
>>>>>>> Initial 100000 documents were crawled and indexed successfully
in
>>>>>>> couple of hours but after that indexing slowed down badly (around
15-20
>>>>>>> documents per min).
>>>>>>>
>>>>>>>
>>>>>>> I am not able to figure out whether there is issue with MCF or
Solr.
>>>>>>>
>>>>>>>
>>>>>>> Can you advice me how to proceed with this?
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Ameya
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Mime
View raw message