manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Need Help on setting up ManifoldCF
Date Thu, 23 Feb 2012 19:46:35 GMT
Hi Anupam,

The Documentum Connector indexes binary documents, as well as the
metadata you select.  If you are not seeing the binary documents get
indexed, you will need to determine whether the problem is in Solr or
in ManifoldCF.  The best way to do that is to look at the Simple
History report in the ManifoldCF UI.  Look for the "document ingest"
event for one or more documents you have crawled.  If the size
reported is greater than zero, then the document was sent to Solr.
You should then look at the Solr standard output to see whether the
Extracting Update Handler has noted that a document was received.  I
believe that it also logs its size.

Thanks,
Karl


On Thu, Feb 23, 2012 at 2:41 PM, Anupam Bhattacharya
<anupamb82@gmail.com> wrote:
> Thanks Karl,
>
> I was just curious.. can the Documentum Connector present in ManifoldCF
> index binary documents also in addition to the content model defined
> document types & its metadata ?
>
> Since configuring documentum repository connection in ManifoldCF for crawler
> and then again in SOLR to fetch the actual document will be repeat work to
> fetch metadata of one document.
>
> Regards
> Anupam
>
>
> On Fri, Feb 24, 2012 at 12:44 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>> Glad it is working for you!
>>
>> Solr is almost infinitely flexible, so you have many options.
>>
>> In my opinion the best way you convert binary documents to indexable
>> text is indeed to use Solr Cell.  Solr Cell is constructed on Tika, so
>> you won't need to bring in Tika for this because it should already be
>> there. Tika has a pipeline architecture which should suit your use
>> case well.   It should thus be possible to configure the existing
>> update handler to use Solr Cell, and configure Solr Cell's Tika
>> instance to perform whatever transformations you need.
>>
>> Hope this helps.  For further Solr questions, you can always ask on
>> the Solr user list.  A Tika user list is also available.
>>
>> Thanks,
>> Karl
>>
>> On Thu, Feb 23, 2012 at 2:04 PM, Anupam Bhattacharya
>> <anupamb82@gmail.com> wrote:
>> > Hello Karl,
>> >
>> > Finally, I was able to index all the metadata for the defined document
>> > types
>> > with different content types. Everything went well.
>> > Although I was not able to index the file full text content. (like PDF,
>> > XML). I read about SOLR Cell where using CURL we can upload documents
>> > but
>> > unfortunately our XML files structure contains Tag & values which also
>> > needs
>> > to be indexed.
>> > e.g, some XML structure..
>> >
>> > <doc>
>> > <object_id>111</object_id>
>> > <abstract>Abstract Text</abstract>
>> > <citation>Citation Text</citation>
>> > <publication>News Source</publication>
>> > </doc>
>> >
>> > I found that in SOLR if we add a new RequestHandler Code extending the
>> > ExtractingRequestHandler we can parse the documents fetch information
>> > and
>> > add it as index field in the SOLR index.
>> >
>> > What is the ideal approach for indexing tag values from XML in lucene
>> > from
>> > ManifoldCF to SOLR ? Is it necessary to integrate TIKA for this ?
>> > I found a good post over here.. https://community.emc.com/docs/DOC-6520
>> >
>> > Appreciate your advice on this.
>> >
>> > Regards
>> > Anupam
>> >
>> >
>> >
>> >
>> > On Thu, Feb 16, 2012 at 12:17 AM, Karl Wright <daddywri@gmail.com>
>> > wrote:
>> >>
>> >> On Wed, Feb 15, 2012 at 1:13 PM, Anupam Bhattacharya
>> >> <anupamb82@gmail.com> wrote:
>> >> > Hello Karl,
>> >> >
>> >> > Thanks for adding this to the JIRA system.
>> >> >
>> >> > The dfc.properties was introduced from Documentum 6.0 version onwards
>> >> > &
>> >> > as
>> >> > per manifoldcf connector documentation
>> >> >
>> >> > (http://incubator.apache.org/connectors/en_US/included-connectors.html)
>> >> > the
>> >> > out-of the box connector classes were tested against DFC 5.3 SP5
>> >> > which
>> >> > needed the dmcl.ini file. Thus run.bat must have been configured
>> >> > properly
>> >> > for that dmcl.ini.
>> >>
>> >> Right - so does DFC 6.0 on Windows require the DOCUMENTUM environment
>> >> variable to be set to point at the directory where dfc.properties is
>> >> found?  Or perhaps it doesn't require the DOCUMENTUM environment
>> >> variable at all anymore?
>> >>
>> >> >
>> >> > As I am trying to connect to DFC 6.5 SP3 version i need to look for
>> >> > dfc.properties file. I hope the out-of the box documentum connector
>> >> > will
>> >> > work with 6.5 version.
>> >>
>> >> It was tried and worked.  The script was developed later with only the
>> >> 5.3 version available.
>> >>
>> >> >
>> >> > I am confused, why for all connector we have Client & Server version
>> >> > ?
>> >> > Can
>> >> > you please explain.
>> >> >
>> >>
>> >> Do you mean "why is there a documentum-connector-server" process?  If
>> >> that's the question, it was created for two reasons:
>> >> (1) We had problems with stability of DFC.  It segfaults occasionally,
>> >> somewhere in its native code.  We did not want that to bring down
>> >> ManifoldCF, and we wanted to be able to restart the part of the
>> >> connector that depended on DFC transparently when it crashed.
>> >> (2) DFC has dependencies on many older open-source jars that conflict
>> >> with the rest of ManifoldCF.  If (1) was not a problem we might have
>> >> used a classloader to fix this, but since we had to fix both we
>> >> created a separate process.
>> >>
>> >> FWIW, we do the same thing for FileNet because of its dependency on
>> >> Wasp.
>> >>
>> >> Karl
>> >>
>> >> > Again, Thanks for all the help.
>> >> >
>> >> > Regards
>> >> > Anupam
>> >> >
>> >> >
>> >> > On Wed, Feb 15, 2012 at 8:42 PM, Karl Wright <daddywri@gmail.com>
>> >> > wrote:
>> >> >>
>> >> >> Hi Anupam,
>> >> >>
>> >> >> I did not see a ticket from you about the DOCUMENTUM environment
>> >> >> variable and the dmcl.ini vs. dfc.properties file.  I've created
an
>> >> >> issue at https://issues.apache.org/jira/browse/CONNECTORS-410 to
>> >> >> track
>> >> >> this problem.  It would be great if you could confirm that: (a)
the
>> >> >> DOCUMENTUM environment variable is still needed at all by DFC,
and
>> >> >> (b)
>> >> >> that when it is set properly, the file dfc.properties can be found
>> >> >> at
>> >> >> $DOCUMENTUM\dfc.properties (on Windows, at least).
>> >> >>
>> >> >> Thanks,
>> >> >> Karl
>> >> >>
>> >> >> On Tue, Feb 14, 2012 at 3:23 PM, Karl Wright <daddywri@gmail.com>
>> >> >> wrote:
>> >> >> > Hi Anupam,
>> >> >> >
>> >> >> > Please post emails like this directly to
>> >> >> > connectors-user@incubator.apache.org.  See below for responses.
>> >> >> >
>> >> >> > On Tue, Feb 14, 2012 at 3:07 PM, Anupam Bhattacharya
>> >> >> > <anupamb82@gmail.com> wrote:
>> >> >> >>
>> >> >> >> Hello Karl,
>> >> >> >>
>> >> >> >> I am a software programmer in DuPont, Gurgaon, India.
Recently,
>> >> >> >> due
>> >> >> >> to
>> >> >> >> the
>> >> >> >> economic instability all over the world the company has
decided
>> >> >> >> to
>> >> >> >> go
>> >> >> >> for
>> >> >> >> cheaper Search Engine Applications. Thus we are getting
rid of
>> >> >> >> many
>> >> >> >> costly
>> >> >> >> proprietary Search Applications and will be replacing
with FAST.
>> >> >> >>
>> >> >> >> Although, I recently came across SOLR search engine &
ManiFoldCF
>> >> >> >> Connector
>> >> >> >> framework. Thus, I am currently driving this effort within
my
>> >> >> >> company
>> >> >> >> as i
>> >> >> >> am a big supporter of open source technologies. I started
my
>> >> >> >> career
>> >> >> >> in
>> >> >> >> Alfresco CMS and now working on Search Technologies.
>> >> >> >>
>> >> >> >> Currently I am facing lots of initial
>> >> >> >> building/deploying/installing
>> >> >> >> issues.
>> >> >> >> I have already referred the url
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> http://incubator.apache.org/connectors/en_US/how-to-build-and-deploy.html
>> >> >> >> Read it multiple times but still face many issues. I downloaded
>> >> >> >> the
>> >> >> >> latest
>> >> >> >> 0.4 version and it seems the documentation is not up to
date on
>> >> >> >> the
>> >> >> >> above
>> >> >> >> link.
>> >> >> >>
>> >> >> >
>> >> >> > The online documentation is pertinent to trunk.  The documentation
>> >> >> > you
>> >> >> > want to use is contained within the 0.4-incubating release.
 Go to
>> >> >> > dist/doc and you will see it there.
>> >> >> >
>> >> >> >> Few issues which took me a long time to resolve which
can be
>> >> >> >> added
>> >> >> >> in
>> >> >> >> ManifoldCF wiki as learnings for others are listed below:
>> >> >> >> a. No single example is given for running the executecommand.bat
>> >> >> >> with
>> >> >> >> proper
>> >> >> >> arguments. Only list of commands given with parameter
defined.
>> >> >> >
>> >> >> > I'm not entirely sure I get this.  Do you just want an example
in
>> >> >> > the
>> >> >> > documentation?
>> >> >> >
>> >> >> >> b. Setting where and which file for the property
>> >> >> >> manifoldcf.configfile
>> >> >> >> for deploying the war on tomcat with Postgresql database.
>> >> >> >
>> >> >> > The documentation already tells you that you need to add an
>> >> >> > appropriate -D to your tomcat invocation to point to your
>> >> >> > properties.xml file.  Tomcat documentation differs from version
to
>> >> >> > version and platform to platform on how best to do that, and
if
>> >> >> > you
>> >> >> > run under Windows there's even a service wrapper with a
>> >> >> > configuration
>> >> >> > UI that allows you to set these parameters.  So it's way
beyond
>> >> >> > ManifoldCF's mission to describe all that, I think.
>> >> >> >
>> >> >> >> c. I am trying to build the Documentum Connector but came
to know
>> >> >> >> that
>> >> >> >> some
>> >> >> >> additional environment variables needs to be added for
>> >> >> >> "DOCUMENTUM".
>> >> >> >> Additionally the latest version of documentum uses dfc.properties
>> >> >> >> file
>> >> >> >> while
>> >> >> >> run.bat look for dctl.ini file.
>> >> >> >
>> >> >> > Could you open a ticket in Jira for this issue?
>> >> >> > https://issues.apache.org/jira. It should not be a problem
if you
>> >> >> > modify the script temporarily, but we can readily make the
script
>> >> >> > look
>> >> >> > for either of these.
>> >> >> >
>> >> >> >> d. postgresql driver is jdbc3 thus it creates problem
with JVM6
>> >> >> >> or
>> >> >> >> above.
>> >> >> >
>> >> >> > We use JDK 6 all the time without problems, so I don't know
what
>> >> >> > you
>> >> >> > are talking about here.
>> >> >> >
>> >> >> >> e. I was getting errors during  the ant build which tries
to
>> >> >> >> delete
>> >> >> >> jar
>> >> >> >> files from lib directory. Don't have the source code right
now
>> >> >> >> with
>> >> >> >> me
>> >> >> >> thus
>> >> >> >> cant provide the full path.
>> >> >> >
>> >> >> > It sounds like you were trying to run ant while you still
had
>> >> >> > ManifoldCF processes running from the same tree.
>> >> >> >
>> >> >> >> f. It was advised in the documentation to set MCF_Home
for
>> >> >> >> example_multiprocess project but it seems the build of
documentum
>> >> >> >> connector
>> >> >> >> refers to this property differently from run.bat.
>> >> >> >
>> >> >> > Yes, this was noticed and fixed on trunk recently.
>> >> >> >
>> >> >> >>
>> >> >> >> Can you please update the Apache ManifoldCF website with
the
>> >> >> >> latest
>> >> >> >> installation procedures. Also, It will be very kind of
you in the
>> >> >> >> meanwhile
>> >> >> >> if you can send few notes for me to head start the configuration
>> >> >> >> of
>> >> >> >> ManifoldCF, with SOLR & Documentum connector.
>> >> >> >>
>> >> >> >
>> >> >> > The documentation online has been updated to be consistent
with
>> >> >> > trunk,
>> >> >> > so if you want to use the trunk version this might be a good
>> >> >> > opportunity to help clarify the documentation.  Either that
or you
>> >> >> > will need to stick with the 0.4-incubating release and the
>> >> >> > 0.4-incubating documentation that is part of it; we cannot
at this
>> >> >> > time update documentation that has already been released.
>> >> >> >
>> >> >> > Thanks,
>> >> >> > Karl
>> >> >> >
>> >> >> >> Looking forward for your help.
>> >> >> >>
>> >> >> >> Thanks & Regards
>> >> >> >> Anupam Bhattacharya
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > Thanks & Regards
>> >> > Anupam Bhattacharya
>> >> >
>> >> >
>> >
>> >
>> >
>> >
>> > --
>> > Thanks & Regards
>> > Anupam Bhattacharya
>> >
>> >
>
>
>
>
> --
> Thanks & Regards
> Anupam Bhattacharya
>
>

Mime
View raw message