This task is to scrape HTML from our old ticket system https://rt.wikimedia.org and can be broken into these steps:
- Go to https://rt.wikimedia.org/ and login (2 logins, IDP and local).
- Look at the the oldest ticket, this is id 2 (compare https://rt.wikimedia.org/Ticket/Display.html?id=2 to https://rt.wikimedia.org/Ticket/Display.html?id=1 why it's not 1).
- Look at the newest ticket, this should be id 11829 (https://rt.wikimedia.org/Ticket/Display.html?id=11829)
- Find or make a tool/plugin/script to save the raw HTML of all the tickets, but in a way that makes it look pretty offline.
- The paths to images/css etc need to keep working locally etc.
You get this desired behaviour for example if you use Firefox and manually click to save the page for offline use, but you will not get it with a simple wget or curl.
Also keep in mind you need to be a logged in user and have permissions for all tickets.
extra requirement:
- Look at the "Queue:" field in tickets and save the HTML in a separate directory for each queue.
So for example https://rt.wikimedia.org/Ticket/Display.html?id=4802 should in a directory called "ops-requests" while https://rt.wikimedia.org/Ticket/Display.html?id=2 should be in a directory called "pmtpa" and so on.
Some tickets can be public but definitely not all tickets can be public and specifically not those in the queue "procurement". Some have been imported into Phabricator and then made public later in Phabricator, some have been imported but not made public and some have not been imported.
Keep the result in a private location but with access for SRE, for now.
Once we have those files the ticket is resolved. Later tickets will be about where we put them and how we shutdown the actual RT app.
This is limited to producing these "static dumps".