Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enqueue strategy check after redirects is not working with adaptive crawler #2525

Open
1 task done
B4nan opened this issue Jun 7, 2024 · 3 comments
Open
1 task done
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@B4nan
Copy link
Member

B4nan commented Jun 7, 2024

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/playwright (PlaywrightCrawler)

Issue description

use enqueueLinks() without any parameters in the request handler on https://crawlee.dev/, at some point it will escape the domain and start scraping everything

https://console.apify.com/actors/PFaajt3k6oOp1YRAU/runs/0SfY5Ocr1dgQjhSIS#log

Code sample

import { PlaywrightCrawler } from 'crawlee';
import { Actor } from 'apify';

await Actor.init();

const crawler = new PlaywrightCrawler({
    proxyConfiguration: await Actor.createProxyConfiguration(),
});
crawler.router.addDefaultHandler(async (ctx) => {
    const $ = await ctx.parseWithCheerio();
    const title = $('html title').text();
    const h1 = $('body h1').text();
    const proxy = ctx.proxyInfo?.username;
    ctx.log.info(`processing ${ctx.request.url}`, { title, h1, proxy });
    await ctx.pushData({ url: ctx.request.url, title, h1 });
    await ctx.enqueueLinks();
});
await crawler.run(['https://crawlee.dev/']);
await Actor.exit();

Package version

3.10.3 beta

Node.js version

20

Operating system

No response

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

No response

@B4nan B4nan added the bug Something isn't working. label Jun 7, 2024
@B4nan B4nan added the t-tooling Issues with this label are in the ownership of the tooling team. label Jun 7, 2024
@janbuchar
Copy link
Contributor

Thanks for the report! Are you aware if there is a page that redirects elsewhere somewhere in the crawlee docs, or is the actual enqueueStrategy check failing (and not the post-redirect check)?

@B4nan
Copy link
Member Author

B4nan commented Jun 7, 2024

looking at the storage, it feels like its not about redirects, we have the edit this page links in there too

image

few more links here, i don't think they come from redirect either

image

@B4nan
Copy link
Member Author

B4nan commented Jun 7, 2024

it almost feels like the adaptive enqueueLinks is not checking the strategies at all, maybe its not about the post-redirect check at all

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

2 participants