manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kayak28 <kaya.ota....@gmail.com>
Subject Re:
Date Thu, 21 Feb 2019 02:46:31 GMT
Hello, Mr. Karl Wright:

Thank you for quick response.
As you mentioned, yes I am so writing my Repository Connector to access the
REST api I want to use.

If I need to do more scraping than provided html-extractor, then I should
write a transformer connector that works as I want.
Is the statement right?  And it is not good idea to do scraping in my
Repository Connector, isn't it?

Again, I appreciate for replying these basic questions.

Sincerely,
Kaya


2019年2月21日(木) 11:26 Karl Wright <daddywri@gmail.com>:

> Hi Kaya,
>
> You should be able to use the existing Solr connector to index documents
> into Solr.
> You will probably need to write a Repository connector to access the REST
> api you describe.
> If the kind of scraping you need to do can be covered by the
> html-extractor transformer in its current form, then you can insert it into
> the pipeline between the other two connections and you should be all set.
>
> Karl
>
>
> On Wed, Feb 20, 2019 at 9:17 PM Kayak28 <kaya.ota.oss@gmail.com> wrote:
>
>> Hello, falks:
>>
>> I have a question about crawling and scraping in Manifold CF.
>> I want to the following sequence of tasks by using MCF.
>>
>> 1. crawling data from RESTful api
>> 2. scraping data
>> 3. insert the data to Apache Solr
>>
>> In this case, how I need to setup Manifold CF is:
>> 1. define output connector to access RESTful api (by using Web crawler
>> connector or Generic connector? )
>>
>> 2. define transformer connector to scrap html (by using html-extractor
>> transformer connector...?)
>> 3. define output connector to be Solr
>>
>>
>> OR do I have to use other software such as Apache Nifi to control the
>> sequence of these tasks?
>>
>> I appreciate for any comments and replays.
>>
>> Sincerely,
>> Kaya
>>
>>
>>

Mime
View raw message