martinsbalodis/warc-content


simple warc archive content browser

License: Other

Language: Python


warc-content

Simple warc archive content browser. This tool takes warc archives as input, indexes them and creates a simple web page where you can browse crawled urls in a tree grid.

I personaly will use this tool to locate useless links in crawled pages. For example - calendars, print pages, image generators.

This is how the webpage looks:

warc content webpage example

Usage

./warccontent.py ~/warcs/*.warc.gz

Wait till data gets indexed and then open http://localhost:8080/ in your browser.

features to add in the future

  • content size counter
  • regex tool to test against urls
  • multiple core support for thos gziped archives

known issues

  • warc-tools library doesn't handle well large files within archives. Large files can cause MemoryError

License

GPLv3

Project Statistics

Sourcerank 4
Repository Size 172 KB
Stars 5
Forks 1
Watchers 2
Open issues 0
Dependencies 0
Contributors 1
Tags 0
Created
Last updated
Last pushed

Top Contributors See all

Mārtiņš Balodis

Something wrong with this page? Make a suggestion

Last synced: 2016-09-27 02:29:55 UTC

Login to resync this repository