-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When does the crawler stop #182
Comments
this is my code. it is extensions. from scrapy.exceptions import NotConfigured
from twisted.internet import task
from scrapy import signals
class AutoCloseSpider(object):
"""
scrapy_redis扩展插件
Parameters
----------
CLOSE_SPIDER_INTERVAL : float
ZERO_THRESHOLD : int
"""
def __init__(self, crawler, stats, interval=60.0, threshold=3):
self.crawler = crawler
self.stats = stats
self.interval = interval
self.threshold = threshold
self.task = None
@classmethod
def from_crawler(cls, crawler):
interval = crawler.settings.getfloat('CLOSE_SPIDER_INTERVAL')
threshold = crawler.settings.getint('ZERO_THRESHOLD')
if not interval and not threshold:
raise NotConfigured
stats = crawler.stats
o = cls(crawler, stats, interval, threshold)
crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
crawler.signals.connect(o.spider_closed, signal=signals.spider_closed)
return o
def spider_opened(self, spider):
# 记录上一次的数据量
self.request_count_prev = 0
# 获取response的次数
self.zero_count = -1
self.task = task.LoopingCall(self.increment, spider)
self.task.start(self.interval)
def increment(self, spider):
# 判断爬虫的request是否为空
request_count = self.stats.get_value('downloader/request_count', 0)
# 这一次的数据量 - 上一次的数据量
inc = (request_count - self.request_count_prev)
if inc == 0:
self.zero_count = 1
elif inc != 0 and self.zero_count != 0:
self.zero_count = 0
# 如果为增量为0的次数超过阈值,则主动关闭爬虫
if self.zero_count >= self.threshold:
self.crawler.engine.close_spider(spider, 'closespider_zerocount')
def spider_closed(self, spider, reason):
if self.task and self.task.running:
self.task.stop() |
How to restart it after it stopped? If I want to work on a new target. |
@liuyuer could you expand on your use case? I like to recycle processes so memory doesn't pile up over time. You could have make you crawler to close after being idle for some time or reaching certain threshold (i.e.: domains scraped, mem usage, etc) and have an external process that monitors at least you have X crawlers running. |
My use case is:
My problem was:
What I did:
@rmax I am not sure if that is the correct way to handle that. I also worry about the memory issue. If you can share more how you recycle processes/clean up, that will very helpful. |
When does the crawler stop
The text was updated successfully, but these errors were encountered: