-
Notifications
You must be signed in to change notification settings - Fork 582
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The request queue scans all 450k (99.999% of which are done towards the end) requests for each iteration #2406
Comments
Uhh, I think I got it. On a whim I decided to monitor file accesses to N is some internal number around 2, meaning it takes a good million file operations before Crawlee can do the next iteration at my scale. I moved the I see that a V2 request queue is being slated for a future release. Is this new request queue moving the done requests in a less I/O intensive place, e.g. an in-memory hashmap (set) of the uniqueKeys? Storing 400k keys in an hashmap is peanuts. That would help tremendously with performance (and disk wear!). |
I was hopping I could solve this with: const memoryStorage = new MemoryStorage({ persistStorage: true, writeMetadata: false })
const requestQueue = await RequestQueue.open(null, { storageClient: memoryStorage })
const crawler = new PlaywrightCrawler({
requestQueue,
}) but sadly But at this stage of my scrape (towards the very end) I obviously cannot afford to start from scratch and never persist the state: the list of already-scraped URLs is very important. |
Same problem, after 12 hours of running (VPS 8GB RAM, 30GB SSD) - docker container with PlaywrightCrawler: $ find request_queues -type f | wc -l INFO PlaywrightCrawler:AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":15,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":1,"actualRatio":0},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.6,"actualRatio":0},"cpuInfo":{"isOverloaded":false,"limitRatio":4,"actualRatio":0},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":0}}} Sometimes currentConcurrency is 0, sometimes it's 1 or more... In my case I don't need the history...do you have any trick? I tried to create a bash script that deletes all the JSON files containing "handledAt", but then the crawler freezes...and I can't stop it because it's running under Docker |
Which package is this bug report for? If unsure which one to select, leave blank
@crawlee/playwright (PlaywrightCrawler) but the request queue is generic. Request queue V1.
Issue description
With a config like this:
And late into the crawling:
but with very few remaining pending queries (<10), Crawlee is behaving weirdly. It idles for minutes at a time, just outputting the occasional
AutoscaledPool
metrics with{"currentConcurrency":0,"desiredConcurrency":6}
. But none of the limits are reached:Then after a while, it actually starts doing the few requests that are actually pending, then quickly goes back to idling, and the cycle continues, therefore making very slow progress. None of the requests are failing. The limit per minute is also not reached:
What I have observed that might help debugging this:
This behaves the same if run under
tsx
or compiling it first withtsc
.Any idea why this happening? How can I force concurrency to not be terrible through configs?minConcurrency
is seemingly being ignored or overridden by another "internal" mechanism – or, more likely because of the CPU usage, by a very CPU-intense processing that is O(n²) or worse where n is the amount of queued requests, including done, therefore making Crawlee slower and slower as scraping progresses. Thanks!Per my following comment, this is happening because each iteration (after it's done with pending requests) needs to scan the entire
request_queues
directory, which involves creating thelock
, reading each file and/or writing them back. In a large crawl like mine (450k), that's 1M disk file operations just to collect the 1 to 10 newly queued requests, which completely defeats the concurrency.I would suggest to make at least the done (non-failed, non-pending) uniqueKeys into an in-memory hashmap so it does not have to be scanned over and over, as in general these requests constitute the largest percentage of requests.
Code sample
No response
Package version
[email protected]
Node.js version
v20.11.1
Operating system
No response
Apify platform
I have tested this on the
next
releaseNo response
Other context
No response
The text was updated successfully, but these errors were encountered: