manifoldcf-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Karl Wright <daddy...@gmail.com>
Subject Re: web crawler not sharing cookies
Date Thu, 26 Jul 2018 07:19:26 GMT
Ok, so the database for your site crawl contains both z.com and x.y.z.com
cookies?  And your site pages from domain a.y.z.com receive no cookies at
all when fetched?  Is that a correct description of the situation?

Please verify that the a.y.z.com pages are part of the protected part of
your "site".  The regular expression that describes site membership for the
login sequence you are trying to set up must include them or they will not
receive any cookies no matter what we do.

If this is set up correctly, then the only explanation is the HttpClient
cookie policy in effect for site fetches.  It does not look like we
override the cookie policy anywhere when setting up the client:

        PoolingHttpClientConnectionManager poolingConnManager = new
PoolingHttpClientConnectionManager(RegistryBuilder.<ConnectionSocketFactory>create()
          .register("http", PlainConnectionSocketFactory.getSocketFactory())
          .register("https", myFactory)
          .build());
        poolingConnManager.setDefaultMaxPerRoute(1);
        poolingConnManager.setValidateAfterInactivity(2000);
        poolingConnManager.setDefaultSocketConfig(SocketConfig.custom()
          .setTcpNoDelay(true)
          .setSoTimeout(socketTimeoutMilliseconds)
          .build());
        connManager = poolingConnManager;
      }


HttpClient tends to default to "strict" when stuff is not specified.  I'll
see if I can find out what the behavior is.

Karl


On Thu, Jul 26, 2018 at 2:29 AM Gustavo Beneitez <gustavo.beneitez@gmail.com>
wrote:

> Hi,
>
> database may contain Z.com and X.Y.Z.com if created automatically through
> a JSP, but not the intermediate one Y.Z.com.
>
> if the crawler decides to go to A.Y.Z.com and looking to database Z.com
> is present, it still doesn't work (it should since A.Y.Z is a sub-domain in
> Z).
>
> Only doing that changes by hand (replacing domain with sub-domain in
> database) and restarting manifold it begins to work.
>
> There might be security constrains somehow, I will consider further
> analysis.
>
> Regards.
>
>
> El jue., 26 jul. 2018 a las 0:06, Karl Wright (<daddywri@gmail.com>)
> escribió:
>
>> The web connector, though, does not filter any cookies.  It takes them
>> all -- whatever cookies HttpClient is storing at that point.  So you should
>> see all the cookies in the database table, regardless of their site
>> affinity, unless HttpClient is refusing to accept a cookie for security
>> reasons.
>>
>> It's also possible that HttpClient is selective about which cookies to
>> transmit on a page fetch.
>>
>> Can you look in the database and tell me whether your cookie gets stored,
>> or not?  If not, then HttpClient's cookie acceptance policy is not lenient
>> enough.  If it is in the database, then it's the transmission policy that
>> is too strict.
>>
>> Thanks,
>> Karl
>>
>>
>> On Wed, Jul 25, 2018 at 4:36 PM Gustavo Beneitez <
>> gustavo.beneitez@gmail.com> wrote:
>>
>>> I agree, but the fact is that if my "login sequence" defines a login
>>> credential for domain "Z.com" and the crawler reaches "Y.Z.com" or "
>>> X.Y.Z.com", none of the sub-sites receives that cookie, I need to write
>>> same cookie  for every sub-domain, that solves the situation (and
>>> thankfully is a language cookie and not a dynamic one).
>>>
>>> Regards.
>>>
>>> El mié., 25 jul. 2018 a las 19:17, Karl Wright (<daddywri@gmail.com>)
>>> escribió:
>>>
>>>> You should not need to fill the database by hand.  Your login sequence
>>>> should include whatever redirection etc is used to set the cookies though.
>>>>
>>>> Karl
>>>>
>>>>
>>>> On Wed, Jul 25, 2018 at 1:06 PM Gustavo Beneitez <
>>>> gustavo.beneitez@gmail.com> wrote:
>>>>
>>>>> Hi again,
>>>>>
>>>>> Thanks Karl, I was able of doing that after defining some "login
>>>>> sequence", but also after filling database (cookiedata table) with certain
>>>>> values due to "domain constrictions".
>>>>> Before every web call, I suspect Manifold only takes cookies from URL
>>>>> exact subdomain (i.e. x.y.z.com), so if you define your cookie as "
>>>>> z.com" it won't be sent, so I added every subdomain by hand and
>>>>> started to work.
>>>>>
>>>>> Regards.
>>>>>
>>>>>
>>>>> El vie., 20 jul. 2018 a las 8:12, Gustavo Beneitez (<
>>>>> gustavo.beneitez@gmail.com>) escribió:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> thanks a lot, please let me check then the documentation for an
>>>>>> example of that.
>>>>>>
>>>>>> Regards!
>>>>>>
>>>>>> El jue., 19 jul. 2018 a las 21:54, Karl Wright (<daddywri@gmail.com>)
>>>>>> escribió:
>>>>>>
>>>>>>> You are correct that cookies are not shared among threads.  That
is
>>>>>>> by design.
>>>>>>>
>>>>>>> The only way to set cookies for the WebConnector is to have there
be
>>>>>>> a "login sequence".  The login sequence sets cookies that are
then used by
>>>>>>> all subsequent fetches.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Karl
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 19, 2018 at 3:38 PM Gustavo Beneitez <
>>>>>>> gustavo.beneitez@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi everyone,
>>>>>>>>
>>>>>>>> I have tried to look for an answer before writing this email,
no
>>>>>>>> luck. Sorry for the inconvenience if it is already answered.
>>>>>>>>
>>>>>>>> I need to set a cookie at the begining of the web crawling.
The
>>>>>>>> cookie rules the language you get the content, and while
there are several
>>>>>>>> choices, if no cookie is found there will be a "default language".
>>>>>>>>
>>>>>>>> I made a JSP which sets the cookie and contains several links
>>>>>>>> (href), and pointed ManifoldCF to this page as the repository
seed. I
>>>>>>>> expected to get the crawling engine starting to capture links
with correct
>>>>>>>> language indicated by the cookie, but what I really got is
a lot of content
>>>>>>>> shown in default language.
>>>>>>>>
>>>>>>>> What I think about that is that cookies are not shared between
>>>>>>>> thread spiders, so it is not possible to get cookies remain
between links.
>>>>>>>> Cookie domain is correct, also cookie expiration
>>>>>>>>
>>>>>>>> I would appreciate so much  if you can help me on this.
>>>>>>>>
>>>>>>>> Thanks in advance!
>>>>>>>>
>>>>>>>>
>>>>>>>>

Mime
View raw message