MediaCAT is open-source web-based application, with a curated search engine. It crawls designated news websites and twitter accounts for citations of or hyperlinks to a list of source sites. MediaCAT then archives all referring stories and source stories, in preparation for an advanced analysis of the relations across the digital news-scape.

Voyage currently has 2 components:

  • Web Server is capable of editing and displaying all the stored data as well as scopes you will provide to Explorer, through your favorite browser.
  • Explorer searches the web using scopes given through to the Web Server and goes for exploring for you. It will automatically store all relevant informations found on the way, so that you can show all the loot through Web Server.


Before installation, verify you meet the following requirements

The required version should be installed on Debian Jessie (and up), as well as Ubuntu 14.04 LTS (and up). You can check your current version by python --version

Note: The project as of right now is supported up to Python 3.5.2. on Ubuntu 16.0.4. It is currently in the works to make the project compatible with Python 3.6.9. on Ubuntu 18.0.4.

If your Python version differs from Python 3.5, we highly recommend using virtual environment tools (such as pyenv) to help manage multiple Python versions.

Typically, to use Python 2 use python. To use Python 3 use python3. To use whatever python version is set in your python virtual environment, use python.

You can check your current version by wget --version


  • Clone the repo
  • Go to the main folder
  • Run the install script:
sudo -i
sudo ./

Set Up Database

Log into admin account

In order to use Postgres, we'll need to log into that account. You can do that by typing:

sudo -i -u postgres

You will be asked for your normal user password and then will be given a shell prompt for the postgres user.

Get a Postgres Prompt

You can get a Postgres prompt immediately by typing:


Add a password for the user:

By default, when you create a PostgreSQL cluster, password authentication for the database superuser (“postgres”) is disabled. In order to make Django have access to this user, you will need to add password savely for this user.

In the Postgres prompt:

postgres=# \password
Enter new password: password
Enter it again:password

Create Database

In the Postgres prompt:

postgres=# create database mediacat;
postgres=# create database crawler;

You may exit out of postgres now

To integrate this database with Django:

Plase configure the databse setting in Frontend/Frontend/ For example:

    'default': {
        'ENGINE': 'django.db.backends.postgresql_psycopg2',
        'NAME': 'mediacat',
        'USER': 'postgres',
        'PASSWORD': 'password',
        'HOST': 'localhost',
        'PORT': '5432',


You can edit the config.yaml file for personal settings


For production instances, be sure to use a new randomized SECRET_KEY in Frontend/Frontend/ A new SECRET key can be generate with the following python script:

import random
''.join(random.SystemRandom().choice('abcdefghijklmnopqrstuvwxyz0123456789!@#$%^&*(-_= )') for _ in range(50))

Twitter Crawler

Please configure twitter credentials in config.yaml before using twitter crawler. You can get twitter credentials from

Usage: Web Server

If it is your FIRST time to run the server:

please make sure to apply migrations under Frontend folder:

python makemigrations
python migrate

And create admin users

python createsuperuser

Otherwise start/stop the server:

  • To start python run

(note: if using port 80, thensudo is needed to run/stop the server)

By default, this Django app is set to listen on all public IPs (port 80).

You can now access the server through http://IP:PORT/admin

The default is



Here you can view your action history and quick navigations to the database


Here, you can view and edit 4 requirement to explore:

  • Referring Sites: The sites in which explorer will look into. It will automatically get validated when adding.
  • Referring Twitter Accounts: The twitter accounts which explorer will look into. It will automatically be validated when adding.
  • Source Sites: The sites which explorer looks for in the articles/tweets if they are used as source.
  • Source Twitter Accounts : The twitter accounts which explorer will look for in articles/tweets if they are used as source.
  • Keywords: The words which explorer look for in the articles/tweets if they are used.


Here, you can view the collected data by the explorer. Furthermore, you can download the archived entry as Web Archive. For demo, it is filled with pre-explored entries.


Here, you can download all the data stored in the database as Json format.


Here, you can view the statistics among the collected entries.

For example, you can view how many articles got collected per day as a Annotation Chart


Here, you can view the relations between each of the 4 scopes, based on the exploration.


Here, you can manage the users and groups used for log in. Furthermore, users can have different permissions.


Once your scope is ready, you may use the following explorers under src folder to crawl news and Tweets:

  • Article Explorer will explore through the Referring Sites for articles
  • Twitter Crawler will explore through Referring Twitter Accounts's posts

Article Explorer

The article explorer will explore each site under a given domain. After this crawler is finished crawling the entire domain the shallow crawler will activate. At this point, the article explorer will only go N levels down from the domain's homepage. A visual prompt indicating shallow crawling will be visible in the Scope/Referring Sites tab. The level value is set to a default of 3, but can be changed in the config.yaml file.

‼️ NOTE: The article crawler can be quite taxing in terms of resources used. On initial tests with the shallow crawler it was found that the article crawler would freeze after a certain amount of time (freezing occured on a server instace with the follwing specs: 1 vCPU, and 2GB RAM with the crawler having 77 domains in its referring scope). Once we began testing using a more powerful server instance (10 vCPU, and 32 GB RAM) the freezing issues stopped. If you do run into freezing issues, the found under the src/ folder contains some lines of code that will automatically restart the crawler after a certain period of time.

Running the Explorer

To run the crawler you must first run the so that the warc files will be created as the crawler runs. Note: We create a screen so that the warc queue can operate in the background.

screen -S warc
python src/

(Ctrl A followed by Ctrl D to get back to the original screen) After this, you must run the actual crawler.

screen -S article
python src/

Twitter Crawler

Twitter crawler has three modes of crawling: timeline, streaming and history, with timeline and streaming based on twarc and history based on GetOldTweets-python.

  • timeline mode will crawl the timeline of Referring Twitter Accounts with up to 3200 of a user's most recent Tweets (Twitter's API constraint). You can set the frequency of timeline re-crawling in config.yaml (the default frequency is crawling timeline every 30 days).
python timeline
  • streaming mode will crawl Tweets of Referring Twitter Accounts on a real-time basis.
python streaming
  • history mode will collect all Tweets posted by Referring Twitter Accounts.
python history

Running twitter crawler with no parameter will run all three modes together by default.



Unit test files are located under src/unit_tests