warc-content

Simple warc archive content browser. This tool takes warc archives as input, indexes them and creates a simple web page where you can browse crawled urls in a tree grid.

I personaly will use this tool to locate useless links in crawled pages. For example - calendars, print pages, image generators.

This is how the webpage looks:

Usage

./warccontent.py ~/warcs/*.warc.gz

Wait till data gets indexed and then open http://localhost:8080/ in your browser.

features to add in the future

content size counter
regex tool to test against urls
multiple core support for thos gziped archives

known issues

warc-tools library doesn't handle well large files within archives. Large files can cause MemoryError

License

GPLv3

Sourcerank	4
Repository Size	172 KB
Stars	5
Forks	1
Watchers	2
Open issues	0
Dependencies	0
Contributors	1
Tags	0
Created	Apr 13, 2013
Last updated	Jan 26, 2017
Last pushed	May 1, 2013

martinsbalodis/warc-content

warc-content

Usage

features to add in the future

known issues

License

Project Statistics

Top Contributors See all