manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kayak28 <>
Subject Re:
Date Thu, 21 Feb 2019 02:46:31 GMT
Hello, Mr. Karl Wright:

Thank you for quick response.
As you mentioned, yes I am so writing my Repository Connector to access the
REST api I want to use.

If I need to do more scraping than provided html-extractor, then I should
write a transformer connector that works as I want.
Is the statement right?  And it is not good idea to do scraping in my
Repository Connector, isn't it?

Again, I appreciate for replying these basic questions.


2019年2月21日(木) 11:26 Karl Wright <>:

> Hi Kaya,
> You should be able to use the existing Solr connector to index documents
> into Solr.
> You will probably need to write a Repository connector to access the REST
> api you describe.
> If the kind of scraping you need to do can be covered by the
> html-extractor transformer in its current form, then you can insert it into
> the pipeline between the other two connections and you should be all set.
> Karl
> On Wed, Feb 20, 2019 at 9:17 PM Kayak28 <> wrote:
>> Hello, falks:
>> I have a question about crawling and scraping in Manifold CF.
>> I want to the following sequence of tasks by using MCF.
>> 1. crawling data from RESTful api
>> 2. scraping data
>> 3. insert the data to Apache Solr
>> In this case, how I need to setup Manifold CF is:
>> 1. define output connector to access RESTful api (by using Web crawler
>> connector or Generic connector? )
>> 2. define transformer connector to scrap html (by using html-extractor
>> transformer connector...?)
>> 3. define output connector to be Solr
>> OR do I have to use other software such as Apache Nifi to control the
>> sequence of these tasks?
>> I appreciate for any comments and replays.
>> Sincerely,
>> Kaya

View raw message