Library for Rapid (Web) Crawler and Scraper Development

This library provides kind of a framework and a lot of ready to use, so-called steps, that you can use as building blocks, to build your own crawlers and scrapers with.

To give you an overview, here's a list of things that it helps you with:

Crawler Politeness 😇 (respecting robots.txt, throttling,...)
Load URLs using
- a (PSR-18) HTTP client (default is of course Guzzle)
- or a headless browser (chrome) to get source after Javascript execution
Get absolute links from HTML documents 🔗
Get sitemaps from robots.txt and get all URLs from those sitemaps
Crawl (load) all pages of a website 🕷
Use cookies (or don't) 🍪
Use any HTTP methods (GET, POST,...) and send any headers or body
Iterate over paginated list pages 🔁
Extract data from:
- HTML and also XML (using CSS selectors or XPath queries)
- JSON (using dot notation)
- CSV (map columns)
Extract schema.org structured data in JSON-LD format from HTML documents
Keep memory usage low by using PHP Generators 💪
Cache HTTP responses during development, so you don't have to load pages again and again after every code change
Get logs about what your crawler is doing (accepts any PSR-3 LoggerInterface)
And a lot more...

Documentation

You can find the documentation at crwlr.software.

Contributing

If you consider contributing something to this package, read the contribution guide (CONTRIBUTING.md).

Name		Name	Last commit message	Last commit date
Latest commit History 211 Commits
.github/workflows		.github/workflows
bin		bin
git-hooks		git-hooks
src		src
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.php-cs-fixer.php		.php-cs-fixer.php
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
composer.json		composer.json
phpstan.neon		phpstan.neon
phpunit.xml		phpunit.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Library for Rapid (Web) Crawler and Scraper Development

Documentation

Contributing

About

Releases

Packages

Languages

License

Cyberschorsch/crawler

Folders and files

Latest commit

History

Repository files navigation

Library for Rapid (Web) Crawler and Scraper Development

Documentation

Contributing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages