Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When does the crawler stop #182

Open
forgeries opened this issue Dec 25, 2020 · 4 comments
Open

When does the crawler stop #182

forgeries opened this issue Dec 25, 2020 · 4 comments

Comments

@forgeries
Copy link

When does the crawler stop

@mkdir700
Copy link

this is my code. it is extensions.

from scrapy.exceptions import NotConfigured
from twisted.internet import task
from scrapy import signals


class AutoCloseSpider(object):
    """
    scrapy_redis扩展插件
    
    Parameters
    ----------
    CLOSE_SPIDER_INTERVAL : float
    
    ZERO_THRESHOLD : int
    """
    
    def __init__(self, crawler, stats, interval=60.0, threshold=3):
        self.crawler = crawler
        self.stats = stats
        self.interval = interval
        self.threshold = threshold
        self.task = None
    
    @classmethod
    def from_crawler(cls, crawler):
        interval = crawler.settings.getfloat('CLOSE_SPIDER_INTERVAL')
        threshold = crawler.settings.getint('ZERO_THRESHOLD')
        
        if not interval and not threshold:
            raise NotConfigured
        
        stats = crawler.stats
        o = cls(crawler, stats, interval, threshold)
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(o.spider_closed, signal=signals.spider_closed)
        return o
    
    def spider_opened(self, spider):
        # 记录上一次的数据量
        self.request_count_prev = 0
        # 获取response的次数
        self.zero_count = -1
        self.task = task.LoopingCall(self.increment, spider)
        self.task.start(self.interval)
    
    def increment(self, spider):
        # 判断爬虫的request是否为空
        request_count = self.stats.get_value('downloader/request_count', 0)
        # 这一次的数据量 - 上一次的数据量
        inc = (request_count - self.request_count_prev)
        
        if inc == 0:
            self.zero_count  = 1
        elif inc != 0 and self.zero_count != 0:
            self.zero_count = 0
        
        # 如果为增量为0的次数超过阈值,则主动关闭爬虫
        if self.zero_count >= self.threshold:
            self.crawler.engine.close_spider(spider, 'closespider_zerocount')
    
    def spider_closed(self, spider, reason):
        if self.task and self.task.running:
            self.task.stop()

@liuyuer
Copy link

liuyuer commented Apr 20, 2021

this is my code. it is extensions.

from scrapy.exceptions import NotConfigured
from twisted.internet import task
from scrapy import signals


class AutoCloseSpider(object):
    """
    scrapy_redis扩展插件
    
    Parameters
    ----------
    CLOSE_SPIDER_INTERVAL : float
    
    ZERO_THRESHOLD : int
    """
    
    def __init__(self, crawler, stats, interval=60.0, threshold=3):
        self.crawler = crawler
        self.stats = stats
        self.interval = interval
        self.threshold = threshold
        self.task = None
    
    @classmethod
    def from_crawler(cls, crawler):
        interval = crawler.settings.getfloat('CLOSE_SPIDER_INTERVAL')
        threshold = crawler.settings.getint('ZERO_THRESHOLD')
        
        if not interval and not threshold:
            raise NotConfigured
        
        stats = crawler.stats
        o = cls(crawler, stats, interval, threshold)
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        crawler.signals.connect(o.spider_closed, signal=signals.spider_closed)
        return o
    
    def spider_opened(self, spider):
        # 记录上一次的数据量
        self.request_count_prev = 0
        # 获取response的次数
        self.zero_count = -1
        self.task = task.LoopingCall(self.increment, spider)
        self.task.start(self.interval)
    
    def increment(self, spider):
        # 判断爬虫的request是否为空
        request_count = self.stats.get_value('downloader/request_count', 0)
        # 这一次的数据量 - 上一次的数据量
        inc = (request_count - self.request_count_prev)
        
        if inc == 0:
            self.zero_count  = 1
        elif inc != 0 and self.zero_count != 0:
            self.zero_count = 0
        
        # 如果为增量为0的次数超过阈值,则主动关闭爬虫
        if self.zero_count >= self.threshold:
            self.crawler.engine.close_spider(spider, 'closespider_zerocount')
    
    def spider_closed(self, spider, reason):
        if self.task and self.task.running:
            self.task.stop()

How to restart it after it stopped? If I want to work on a new target.

@rmax
Copy link
Owner

rmax commented Apr 20, 2021

@liuyuer could you expand on your use case?

I like to recycle processes so memory doesn't pile up over time. You could have make you crawler to close after being idle for some time or reaching certain threshold (i.e.: domains scraped, mem usage, etc) and have an external process that monitors at least you have X crawlers running.

@liuyuer
Copy link

liuyuer commented Apr 20, 2021

@liuyuer could you expand on your use case?

I like to recycle processes so memory doesn't pile up over time. You could have make you crawler to close after being idle for some time or reaching certain threshold (i.e.: domains scraped, mem usage, etc) and have an external process that monitors at least you have X crawlers running.

My use case is:

  1. Crawler runs as a service, when the crawler reached a threshold, it could stop with self.crawler.engine.close_spider.
  2. The crawler could restart when it received a new target to work on.

My problem was:

  1. The crawler could not restart after if was stopped by self.crawler.engine.close_spider.
  2. I need to clean up the redis keys. So that the result will not mix up.

What I did:

  1. I am using Scrapydo to take care of the new process so I can restart Scrapy(not scrapy-redis) in a new process.

@rmax I am not sure if that is the correct way to handle that. I also worry about the memory issue. If you can share more how you recycle processes/clean up, that will very helpful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants