@CristianCantoro has proposed introducing a Tor hidden service for reading and editing Wikipedia. This task is for the technical details of such a gateway.
Service or integrated
The basic implementation options are:
- Make a frontend or proxy which rewrites URLs and makes any necessary skin modifications to indicate to the user that they are on the .onion site.
- Reconfigure or hook MediaWiki to make it generate the right HTML in the first place.
EOTK is an example of option 1, it uses ~500 lines of nginx configuration and embedded Lua to do fairly naïve URL rewriting. It doesn't attempt to properly parse the JS, CSS and HTML that it rewrites.
MobileFrontend shows approximately what it would take to do option 2. It uses a BeforePageRedirect hook to modify 30x responses. It avoids a lot of HTML rewriting that EOTK tries to do, with a bit of knowledge of MediaWiki. MediaWiki uses host-relative URLs in internal links, and CSS and JS references, so as long as the path structure is the same, there's no need to rewrite them. T156847 is a proposal to make MediaWiki aware of the domain it is being viewed under, to reduce the need for these assumptions and hacks.
The old secure.wikimedia.org gateway was along the lines of option 2, but with a single domain name and path rewriting. It reconfigured MediaWiki on startup and fragmented the parser cache.
The meta wiki page suggests iaproxy and @csteipp's mediawiki-proxy as possible off the shelf service-based implementations.
The hostname and path
It would be easy to use scallion to brute-force the first 9 characters of the key hash to obtain wikipediaXXXXXXX.onion, where XXXXXXX is 7 random characters. It is not so trivial to brute-force 11 or 12 characters, hundreds of times, in order to include the language code in the 2LD. However, it is reportedly possible to have subdomains of .onion domains. This is not mentioned by the Tor design paper, which proposes a different interpretation of the third-level domain name label, but appears to be common practice. The Tor client apparently strips out the third level domain when establishing the circuit, and then the browser sends it in the Host header as normal.
So we can have en.wikipediaXXXXXXX.onion/wiki/Foo, or we can have wikipediaXXXXXXX.onion/en/wiki/Foo, if we allow path rewriting similar to what was done in secure.wikimedia.org.
Abuse control considerations
It's proposed that it won't be possible to edit via Tor unless logged in. So we won't have the issue of MW attributing hidden service edits to an internal IP address. There will always be a username for attribution. However, the CheckUser extension may need modification to tag users who are using the hidden service in a human-readable way.