Cookiejars exposed #6218

GeorgeA92 · 2024-02-09T12:16:39Z

Aimed to fix #1878
based on suggestion from #1878 (comment)

codecov · 2024-02-09T12:18:31Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.67%. Comparing base (6f73dc0) to head (b8f8960).
Report is 269 commits behind head on master.

❗ Current head b8f8960 differs from pull request most recent head 2f45717

Please upload reports for the commit 2f45717 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6218       /-   ##
==========================================
  Coverage   88.48%   88.67%    0.18%     
==========================================
  Files         160      161        1     
  Lines       11607    11792      185     
  Branches     1883     1912       29     
==========================================
  Hits        10271    10457      186     
  Misses       1009     1007       -2     
- Partials      327      328        1

Files	Coverage Δ
scrapy/downloadermiddlewares/cookies.py	`96.33% <100.00%> ( 0.13%)`	⬆️

... and 14 files with indirect coverage changes

wRAR · 2024-02-09T12:36:27Z

I haven't read the original ticket recently, but why is this feature optional?

scrapy/downloadermiddlewares/cookies.py

GeorgeA92 · 2024-02-23T10:28:20Z

This is how it works on current version of PR

script_sample.py

import scrapy
from scrapy.crawler import CrawlerProcess

class Quotes(scrapy.Spider):
    name = "quotes"; custom_settings = {"DOWNLOAD_DELAY": 1}

    def start_requests(self):
        yield scrapy.Request(url='https://quotes.toscrape.com/login', callback=self.login)

    def login(self, response):
        self.logger.info(self.cookie_jars[None]) # scrapy.http.cookies.CookieJar object
        self.logger.info(self.cookie_jars[None].jar) # http.cookiejar object

        locale_cookie = self.cookie_jars[None]._cookies["quotes.toscrape.com"]["/"].get("session")
        locale_cookie.value = locale_cookie.value.upper()
        self.logger.info(self.cookie_jars[None].jar)

if __name__ == "__main__":
    p = CrawlerProcess(); p.crawl(Quotes); p.start()

log_output (fragment)

2024-02-23 10:51:27 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/login> (referer: None)
2024-02-23 10:51:27 [quotes] INFO: <scrapy.http.cookies.CookieJar object at 0x00000217DB719B40>
2024-02-23 10:51:27 [quotes] INFO: <CookieJar[<Cookie session=eyJjc3JmX3Rva2VuIjoiSnFQQU9GTGt1amZzZ3J3UVdHeGV6WFR2UnBpY0Job1NWS3liWmxhblVISXROREVtQ2RZTSJ9.Zdhqng.8uQzjuvDfOcNJHV7luY5Na6C1N0 for quotes.toscrape.com/>]>
2024-02-23 10:51:27 [quotes] INFO: <CookieJar[<Cookie session=EYJJC3JMX3RVA2VUIJOISNFQQU9GTGT1AMZZZ3J3UVDHEGV6WFR2UNBPY0JOB1NWS3LIWMXHBLVISXROREVTQ2RZTSJ9.ZDHQNG.8UQZJUVDFOCNJHV7LUY5NA6C1N0 for quotes.toscrape.com/>]>
2024-02-23 10:51:27 [scrapy.core.engine] INFO: Closing spider (finished)

Gallaecio · 2024-02-23T17:17:01Z

I‘m slightly hesitant about setting a spider attribute from a middleware, and I wonder if maybe it should be set from a different place or in a different place (e.g. the crawler), but in general I’m fine with the approach.

@kmike Any thoughts on the general approach? Should @GeorgeA92 go on with tests and docs?

kmike · 2024-03-17T17:22:24Z

Hey! My main worry is the obscure API, which we'd need to document & support in the future. It'd require good documentation to explain a line like

self.cookie_jars[None]._cookies["quotes.toscrape.com"]["/"].get("session")

It also need an access to a private property (._cookies).

GeorgeA92 · 2024-04-12T16:36:21Z

Hey! My main worry is the obscure API, which we'd need to document & support in the future. It'd require good documentation to explain a line like

self.cookie_jars[None]._cookies["quotes.toscrape.com"]["/"].get("session")

It also need an access to a private property (._cookies).

Another option is to update scrapy.http.Cookies.CookieJar class to add.. more convenient way to interact with Cookiejar

scrapy/scrapy/http/cookies.py

Lines 18 to 30 in 1d11ea3

 class CookieJar: 

 def __init__(self, policy=None, check_expired_frequency=10000): 

 self.policy = policy or DefaultCookiePolicy() 

 self.jar = _CookieJar(self.policy) 

 self.jar._cookies_lock = _DummyLock() 

 self.check_expired_frequency = check_expired_frequency 

 self.processed = 0 

 def extract_cookies(self, response, request): 

 wreq = WrappedRequest(request) 

 wrsp = WrappedResponse(response) 

 return self.jar.extract_cookies(wrsp, wreq)

docs/faq.rst

Gallaecio · 2024-06-11T06:57:14Z

docs/topics/downloader-middleware.rst

@@ -293,6 293,52 @@ Here's an example of a log with :setting:`COOKIES_DEBUG` enabled::
 [...]


+.. _cookiejars:
+
+Direct access to cookiejars from spider


Maybe this section could go before COOKIES_ENABLED?

I am not sure about this. Obviously COOKIES_ENABLED is more important than this, then as I think it should be after as it placed now

docs/topics/downloader-middleware.rst

Gallaecio · 2024-06-11T08:00:41Z

@GeorgeA92 @kmike @wRAR What about an API like this?

# cookie middleware code

from scrapy import Request

from http.cookies imort SimpleCookie


_UNSET = object()


# We define a Cookie class that we can extend in the future, based on the
# Python API for server-side cookie handling.
class Cookie(SimpleCookie):
    pass
    

# We define functions, in line with the get_retry_request approach, that can be 
# used to easily interact with the cookies of a request. We extract the cookie
# jar ID and domain from the request, the user indicates the key and optionally
# the path.
    
def get_cookie(request: Request, key: str, path="/") -> Cookie:
    ...
    

def pop_cookie(request: Request, key: str, path="/", default=_UNSET) -> Cookie:
    ...
    

def set_cookie(request: Request, cookie: Cookie) -> None:
    ...

Co-authored-by: Adrián Chaves <[email protected]>

wRAR · 2024-07-10T12:03:21Z

Isn't it like this?

def get_cookie(self, request: Request, key: str, path="/") -> Cookie:
    jar = self.jars[request.meta.get("cookiejar")]
    cookie = jar._cookies[key][path]
    return cookie

This looks clean enough and doesn't require users to touch any other APIs, or am I missing something?

cookiejars exposed

fe2979e

Gallaecio reviewed Feb 9, 2024

View reviewed changes

scrapy/downloadermiddlewares/cookies.py Outdated Show resolved Hide resolved

cookiejars exposed

b8f8960

Gallaecio requested a review from kmike February 23, 2024 17:14

Cj-Malone mentioned this pull request Mar 21, 2024

Add Anti-Bot Detection middleware alltheplaces/alltheplaces#7349

Draft

GeorgeA92 changed the title ~~Cookiejars exposed~~ [WIP]Cookiejars exposed Jun 7, 2024

Georgiy Zatserklianyi added 2 commits June 7, 2024 14:48

cookiejars exposed, docs added

11022ac

cookiejars exposed, docs added

d8fb9d2

Gallaecio reviewed Jun 11, 2024

View reviewed changes

docs/faq.rst Outdated Show resolved Hide resolved

Gallaecio reviewed Jun 11, 2024

View reviewed changes

docs/topics/downloader-middleware.rst Outdated Show resolved Hide resolved

GeorgeA92 and others added 2 commits June 21, 2024 10:10

Update docs/faq.rst

53db741

Co-authored-by: Adrián Chaves <[email protected]>

Update docs/topics/downloader-middleware.rst

2f45717

Co-authored-by: Adrián Chaves <[email protected]>

GeorgeA92 changed the title ~~[WIP]Cookiejars exposed~~ Cookiejars exposed Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cookiejars exposed #6218

Cookiejars exposed #6218

GeorgeA92 commented Feb 9, 2024

codecov bot commented Feb 9, 2024 •

edited

Loading

wRAR commented Feb 9, 2024

GeorgeA92 commented Feb 23, 2024 •

edited

Loading

Gallaecio commented Feb 23, 2024 •

edited

Loading

kmike commented Mar 17, 2024

GeorgeA92 commented Apr 12, 2024

Gallaecio Jun 11, 2024

GeorgeA92 Jun 21, 2024

Gallaecio commented Jun 11, 2024

wRAR commented Jul 10, 2024

Cookiejars exposed #6218

Are you sure you want to change the base?

Cookiejars exposed #6218

Conversation

GeorgeA92 commented Feb 9, 2024

codecov bot commented Feb 9, 2024 • edited Loading

Codecov Report

wRAR commented Feb 9, 2024

GeorgeA92 commented Feb 23, 2024 • edited Loading

Gallaecio commented Feb 23, 2024 • edited Loading

kmike commented Mar 17, 2024

GeorgeA92 commented Apr 12, 2024

Gallaecio Jun 11, 2024

Choose a reason for hiding this comment

GeorgeA92 Jun 21, 2024

Choose a reason for hiding this comment

Gallaecio commented Jun 11, 2024

wRAR commented Jul 10, 2024

codecov bot commented Feb 9, 2024 •

edited

Loading

GeorgeA92 commented Feb 23, 2024 •

edited

Loading

Gallaecio commented Feb 23, 2024 •

edited

Loading