manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: Need Help on setting up ManifoldCF
Date Thu, 23 Feb 2012 19:14:22 GMT
Glad it is working for you!

Solr is almost infinitely flexible, so you have many options.

In my opinion the best way you convert binary documents to indexable
text is indeed to use Solr Cell.  Solr Cell is constructed on Tika, so
you won't need to bring in Tika for this because it should already be
there. Tika has a pipeline architecture which should suit your use
case well.   It should thus be possible to configure the existing
update handler to use Solr Cell, and configure Solr Cell's Tika
instance to perform whatever transformations you need.

Hope this helps.  For further Solr questions, you can always ask on
the Solr user list.  A Tika user list is also available.

Thanks,
Karl

On Thu, Feb 23, 2012 at 2:04 PM, Anupam Bhattacharya
<anupamb82@gmail.com> wrote:
> Hello Karl,
>
> Finally, I was able to index all the metadata for the defined document types
> with different content types. Everything went well.
> Although I was not able to index the file full text content. (like PDF,
> XML). I read about SOLR Cell where using CURL we can upload documents but
> unfortunately our XML files structure contains Tag & values which also needs
> to be indexed.
> e.g, some XML structure..
>
> <doc>
> <object_id>111</object_id>
> <abstract>Abstract Text</abstract>
> <citation>Citation Text</citation>
> <publication>News Source</publication>
> </doc>
>
> I found that in SOLR if we add a new RequestHandler Code extending the
> ExtractingRequestHandler we can parse the documents fetch information and
> add it as index field in the SOLR index.
>
> What is the ideal approach for indexing tag values from XML in lucene from
> ManifoldCF to SOLR ? Is it necessary to integrate TIKA for this ?
> I found a good post over here.. https://community.emc.com/docs/DOC-6520
>
> Appreciate your advice on this.
>
> Regards
> Anupam
>
>
>
>
> On Thu, Feb 16, 2012 at 12:17 AM, Karl Wright <daddywri@gmail.com> wrote:
>>
>> On Wed, Feb 15, 2012 at 1:13 PM, Anupam Bhattacharya
>> <anupamb82@gmail.com> wrote:
>> > Hello Karl,
>> >
>> > Thanks for adding this to the JIRA system.
>> >
>> > The dfc.properties was introduced from Documentum 6.0 version onwards &
>> > as
>> > per manifoldcf connector documentation
>> > (http://incubator.apache.org/connectors/en_US/included-connectors.html)
>> > the
>> > out-of the box connector classes were tested against DFC 5.3 SP5 which
>> > needed the dmcl.ini file. Thus run.bat must have been configured
>> > properly
>> > for that dmcl.ini.
>>
>> Right - so does DFC 6.0 on Windows require the DOCUMENTUM environment
>> variable to be set to point at the directory where dfc.properties is
>> found?  Or perhaps it doesn't require the DOCUMENTUM environment
>> variable at all anymore?
>>
>> >
>> > As I am trying to connect to DFC 6.5 SP3 version i need to look for
>> > dfc.properties file. I hope the out-of the box documentum connector will
>> > work with 6.5 version.
>>
>> It was tried and worked.  The script was developed later with only the
>> 5.3 version available.
>>
>> >
>> > I am confused, why for all connector we have Client & Server version ?
>> > Can
>> > you please explain.
>> >
>>
>> Do you mean "why is there a documentum-connector-server" process?  If
>> that's the question, it was created for two reasons:
>> (1) We had problems with stability of DFC.  It segfaults occasionally,
>> somewhere in its native code.  We did not want that to bring down
>> ManifoldCF, and we wanted to be able to restart the part of the
>> connector that depended on DFC transparently when it crashed.
>> (2) DFC has dependencies on many older open-source jars that conflict
>> with the rest of ManifoldCF.  If (1) was not a problem we might have
>> used a classloader to fix this, but since we had to fix both we
>> created a separate process.
>>
>> FWIW, we do the same thing for FileNet because of its dependency on Wasp.
>>
>> Karl
>>
>> > Again, Thanks for all the help.
>> >
>> > Regards
>> > Anupam
>> >
>> >
>> > On Wed, Feb 15, 2012 at 8:42 PM, Karl Wright <daddywri@gmail.com> wrote:
>> >>
>> >> Hi Anupam,
>> >>
>> >> I did not see a ticket from you about the DOCUMENTUM environment
>> >> variable and the dmcl.ini vs. dfc.properties file.  I've created an
>> >> issue at https://issues.apache.org/jira/browse/CONNECTORS-410 to track
>> >> this problem.  It would be great if you could confirm that: (a) the
>> >> DOCUMENTUM environment variable is still needed at all by DFC, and (b)
>> >> that when it is set properly, the file dfc.properties can be found at
>> >> $DOCUMENTUM\dfc.properties (on Windows, at least).
>> >>
>> >> Thanks,
>> >> Karl
>> >>
>> >> On Tue, Feb 14, 2012 at 3:23 PM, Karl Wright <daddywri@gmail.com>
>> >> wrote:
>> >> > Hi Anupam,
>> >> >
>> >> > Please post emails like this directly to
>> >> > connectors-user@incubator.apache.org.  See below for responses.
>> >> >
>> >> > On Tue, Feb 14, 2012 at 3:07 PM, Anupam Bhattacharya
>> >> > <anupamb82@gmail.com> wrote:
>> >> >>
>> >> >> Hello Karl,
>> >> >>
>> >> >> I am a software programmer in DuPont, Gurgaon, India. Recently,
due
>> >> >> to
>> >> >> the
>> >> >> economic instability all over the world the company has decided
to
>> >> >> go
>> >> >> for
>> >> >> cheaper Search Engine Applications. Thus we are getting rid of
many
>> >> >> costly
>> >> >> proprietary Search Applications and will be replacing with FAST.
>> >> >>
>> >> >> Although, I recently came across SOLR search engine & ManiFoldCF
>> >> >> Connector
>> >> >> framework. Thus, I am currently driving this effort within my
>> >> >> company
>> >> >> as i
>> >> >> am a big supporter of open source technologies. I started my career
>> >> >> in
>> >> >> Alfresco CMS and now working on Search Technologies.
>> >> >>
>> >> >> Currently I am facing lots of initial building/deploying/installing
>> >> >> issues.
>> >> >> I have already referred the url
>> >> >>
>> >> >>
>> >> >> http://incubator.apache.org/connectors/en_US/how-to-build-and-deploy.html
>> >> >> Read it multiple times but still face many issues. I downloaded
the
>> >> >> latest
>> >> >> 0.4 version and it seems the documentation is not up to date on
the
>> >> >> above
>> >> >> link.
>> >> >>
>> >> >
>> >> > The online documentation is pertinent to trunk.  The documentation
>> >> > you
>> >> > want to use is contained within the 0.4-incubating release.  Go to
>> >> > dist/doc and you will see it there.
>> >> >
>> >> >> Few issues which took me a long time to resolve which can be added
>> >> >> in
>> >> >> ManifoldCF wiki as learnings for others are listed below:
>> >> >> a. No single example is given for running the executecommand.bat
>> >> >> with
>> >> >> proper
>> >> >> arguments. Only list of commands given with parameter defined.
>> >> >
>> >> > I'm not entirely sure I get this.  Do you just want an example in
the
>> >> > documentation?
>> >> >
>> >> >> b. Setting where and which file for the property
>> >> >> manifoldcf.configfile
>> >> >> for deploying the war on tomcat with Postgresql database.
>> >> >
>> >> > The documentation already tells you that you need to add an
>> >> > appropriate -D to your tomcat invocation to point to your
>> >> > properties.xml file.  Tomcat documentation differs from version to
>> >> > version and platform to platform on how best to do that, and if you
>> >> > run under Windows there's even a service wrapper with a configuration
>> >> > UI that allows you to set these parameters.  So it's way beyond
>> >> > ManifoldCF's mission to describe all that, I think.
>> >> >
>> >> >> c. I am trying to build the Documentum Connector but came to know
>> >> >> that
>> >> >> some
>> >> >> additional environment variables needs to be added for "DOCUMENTUM".
>> >> >> Additionally the latest version of documentum uses dfc.properties
>> >> >> file
>> >> >> while
>> >> >> run.bat look for dctl.ini file.
>> >> >
>> >> > Could you open a ticket in Jira for this issue?
>> >> > https://issues.apache.org/jira. It should not be a problem if you
>> >> > modify the script temporarily, but we can readily make the script
>> >> > look
>> >> > for either of these.
>> >> >
>> >> >> d. postgresql driver is jdbc3 thus it creates problem with JVM6
or
>> >> >> above.
>> >> >
>> >> > We use JDK 6 all the time without problems, so I don't know what you
>> >> > are talking about here.
>> >> >
>> >> >> e. I was getting errors during  the ant build which tries to delete
>> >> >> jar
>> >> >> files from lib directory. Don't have the source code right now
with
>> >> >> me
>> >> >> thus
>> >> >> cant provide the full path.
>> >> >
>> >> > It sounds like you were trying to run ant while you still had
>> >> > ManifoldCF processes running from the same tree.
>> >> >
>> >> >> f. It was advised in the documentation to set MCF_Home for
>> >> >> example_multiprocess project but it seems the build of documentum
>> >> >> connector
>> >> >> refers to this property differently from run.bat.
>> >> >
>> >> > Yes, this was noticed and fixed on trunk recently.
>> >> >
>> >> >>
>> >> >> Can you please update the Apache ManifoldCF website with the latest
>> >> >> installation procedures. Also, It will be very kind of you in the
>> >> >> meanwhile
>> >> >> if you can send few notes for me to head start the configuration
of
>> >> >> ManifoldCF, with SOLR & Documentum connector.
>> >> >>
>> >> >
>> >> > The documentation online has been updated to be consistent with
>> >> > trunk,
>> >> > so if you want to use the trunk version this might be a good
>> >> > opportunity to help clarify the documentation.  Either that or you
>> >> > will need to stick with the 0.4-incubating release and the
>> >> > 0.4-incubating documentation that is part of it; we cannot at this
>> >> > time update documentation that has already been released.
>> >> >
>> >> > Thanks,
>> >> > Karl
>> >> >
>> >> >> Looking forward for your help.
>> >> >>
>> >> >> Thanks & Regards
>> >> >> Anupam Bhattacharya
>> >> >>
>> >> >>
>> >> >>
>> >
>> >
>> >
>> >
>> > --
>> > Thanks & Regards
>> > Anupam Bhattacharya
>> >
>> >
>
>
>
>
> --
> Thanks & Regards
> Anupam Bhattacharya
>
>

Mime
View raw message