manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From ritika jain <>
Subject Re: Extraction and storing parent URL while crawling
Date Mon, 11 May 2020 13:35:17 GMT
Hello Users,

Can anybody please revert on this. It would be highly appreciated.

On Fri, Apr 3, 2020 at 2:28 PM ritika jain <> wrote:

> Hi All,
> I am using Manifoldcf 2.14 to crawl data from a website using Web as Repo
> connector and Elastic Search as output connector,
> I want to get some knowledge about the crawling framework/hierarchy used
> by the webcrawler.
> As far as I know or I understand the crawling of the URL's works in the
> manner of tree structure.
> I want to know if there is any functionality supported by manifoldcf as of
> now to store parent URL of a document
> For example seed URL is: and at document queue 80th
> number our document identifier is
> Is there any way manifolcf is storing the back traced URL's, that means by
> following which hierarchy level the 80th document has came from.
> Like to store 79th, 78th,77th level of document crawl to reach 80th number
> of documents followed by seed document.
> Is this crawling hierarchy (if only level also), is being stored somewhere
> in manifoldcf code yet. If yes does this framework code is present in the
> form of jar.?? helpful or if not in jar any clue to which Java file this
> logic is being implemented, will be really.
> Any kind of clue or help will be really appreciated.
> Many Thanks
> Ritika

View raw message