[Home]KeptPages

MeatballWiki | RecentChanges | Random Page | Indices | Categories

Motivation

Several wikis have found it useful to keep a VersionHistory of each wiki page. These older versions are especially useful to correct destructive mistakes or mild vandalism (cf. ReversibleChange). Some wikis keep all changes (FullHistory), while others keep only a single old copy (often the previous author's version). Each kind of versioning has its advantages and disadvantages. For instance, permanent history means that vandalism can almost always be reversed, but some people are concerned about their mistakes becoming a permanent part of a page's history, thwarting ForgiveAndForget. On the other hand, a system that only tracks a fixed number of copies is easily defeated by making the fixed number plus one of fresh edits--either by an attacker, or though the actions of many innocent authors.

The original idea of KeptPages was to find a middle ground between keeping all versions and the keeping a small fixed number of versions.

Solution

We chose to version everything, but only keep revisions around for a limited timespan, say a week or a month. That is, destroy all revisions older than the timespan. The timespan will naturally depend on the amount of traffic to a site, the slower the traffic, the higher the timespan.

This scheme is not sufficient, though, as an attacker can destroy a page that hasn't been touched in several months by merely editing it. After all, all its prior versions have been automatically erased, leaving you with only the latest, vandalized version.

Consequently, we add a little twist: timestamp the revisions not when they are created but when they are replaced. So, the current version of the page is not in the revision history, but when I make a change, the current version gets timestamped then and entered into the history. Then the new edit becomes the current version. See KeptPagesExample.

Equivalently, and more simply, you keep all previous revisions created during the timespan, plus one more.

Basically, instead of keep a fixed number of backups, you keep a fixed time of backups. (It will probably still be useful keep one backup version of each page anyway for comparisons in the far future.)

This way, as you just keep everything, you protect against any form of vandal attack, yet you also ForgiveAndForget mistakes. Moreover, if a vandal wants to destroy an old page, the old page will still be around for a week. This reduces the PerceptionOfVulnerability and is an example of DelayAction.

See also PageDeletion and FileReplacement for technology inspired by KeptPages.

For a different explanation of the same thing, see the Kept Pages section of the Oddmuse manual. [1]

CategoryWikiTechnology


Advantages.

One positive effect is that these kind of KeptPages could encourage reworking by minimizing the impact of mistakes. People could be more free to remove content from the current version, because the old versions will remain available for a decent time.

This idea allows any number of people to participate in editing a page, without anyone's current contributions being lost. On the other hand, other kept versions of old/embarassing content will be expired reasonably soon. More copies are saved for more popular pages (edited by more people). Pages that haven't been edited recently will have only the current version.

Although disk space wasn't a major concern, by our estimates the KeptPages would require much less storage than the current multi-copy system (which has potentially many copies for pages which haven't been edited in months). Even the worst cases (like a SandBox page or someone's diary) will have a reasonable number of copies--it will keep one page for each distinct author in the past N days (where N is the expiration interval). Contrast with a FullHistory wiki containing a SandBox of hundreds of edits (like Wiki:WikiWikiSandbox).


Disadvantages.

Additional care needed.

KeptPages does make it more important not to make lax edits (which is probably a good thing). Now, the entire history of a page is kept for two weeks, including every one of your little tweaks. It is possible now to be hung out to dry on trumped up charges based on poorly written text, text that didn't properly explain your intent so you changed it. Consequently, because we AssumeGoodFaith, the best thing to do as a reader is to assume the current version of the page properly reflects the attitudes of all parties involved. The history of the site isn't really that important because visitors don't care.

Secondly, KeptPages also protects an attacker's work. Thus, the attacker can just as easily restore the damaged page contents as a defender. This becomes a problem during flame wars (--but those require a CommunitySolution, not a TechnologySolution to fix, see ConflictResolution--) and when vandals seek attention.

In the latter case, it's normally better to laugh with the vandals and then fix the site (vandalism and pretense both). Taking stuff too seriously invites vandalism, after all, maybe with some justification. Compare the SlashDot trolls on SlashDot vs. KuroShin to see this in action (see TrollTalk). RobMalda is harsh with the trolls, inviting flamewars, whereas RustyFoster hangs out on TrollTalk and has a sense of humour. Not that trolls are malicious vandals, but the approach is the same.

Disk storage.

I was thinking about the problem of the version history consuming excessive disk space. While I agree that holding every version is a good ideal, one idea is to compress the changes after a disk usage (or version count) threshold is exceeded. That is, starting from the second oldest change, for each consecutive run of edits by the same author, replace the entire run by the last edit by that author. Repeat until the kept pages fall under the threshold or you run through the entire history. Then, if you're adamant about pulling the versions under the history, you can start dropping the oldest versions until you're underneath the threshold. -- SunirShah

A good version control system should only store differences, and I'd expect that the differences between successive edits by the same author would be small. Or even between different authors. Eg for this edit I am making, the storage should increase by only a single paragraph. My current hard disk is 40 Gigabytes. It will take a lot of 1-paragraph edits to fill up that. -- DaveHarris

Disk space isn't really a problem, but there are still tradeoffs to be made. Usemod.com is currently (December 2000) on a shared server (which is far cheaper than a dedicated server, and much less hassle than trying to run a 24X7 server at home). I have paid for 75 MB of storage, and extra storage is $5/20MB/month. (This isn't the cheapest provider, but they have a good record of actually giving people their full allotments rather than shutting down full-usage accounts.)

Since the disk space is mine, but I share the CPU with many other accounts, I decided to minimize CPU usage at the cost of extra disk space. Each edit stores a full copy of the previous version in the kept-pages database. While it is possible to reconstruct versions from a string of diffs, it is much more CPU-intensive. I can buy more disk space easily, but more CPU will require a dedicated server. (About $250/month more expensive.)

MeatballWiki currently has about 50 megabytes of space free for expansion. This is enough room for 1000 edits of 50K pages within the expiration time (currently at the default value of 14 days). For comparison, MeatballWiki has had about 5000 edits in the past 6 months, and the vast majority of pages are well under 50K. (The per-page average storage was about 10K before the conversion, including the page text, up to two edit-copies (major author), and up to three stored diffs.) I'm willing to buy more space if it is being used well. --CliffordAdams

Did you consider storing the deltas to minimise space, and then lazily caching versions of pages to minimise time? This would allow you to only store full pages for versions which were actually being used. -- DaveHarris

Did you ever consider storing reverse deltas, ala [RCS]? That way you conserve CPU by storing a full copy of the most current version, which is most likely to be accessed, but you reduce space usage for historical versions, as you store them as deltas from the most recent version. --anon.


Most major wikis now implement Sunir's excellent proposal for versioning.


Another minor problem with this I can see :

Loss of Original Author

Perhaps the very first version of a page should be kept, regardless of the passage of time. When we lose the very first incarnation of a page, not only do we lose the chance to see how it has changed (sometimes this is for a lark, but it could be useful on many pages to trace a changing idea), but we also lose the creator's identity. When we don't know the originator of the page, we don't necessarily know who to contact for more information, or who to cite in a report. I'm writing a report on wikis right now, and unfortunately, most information posted on various wikis about wiki concepts and history all happened months or years ago, their original versions having expired, and leaving me to cite pages anonymously. By keeping the original page, we don't solve all of these cases, but it may be a piece of history that a wiki community may be eager to hold on to.

-- Josh Hoey, 2003-05-10

True, but I think if you wanted to keep around authorship information, then it would be just as important to keep info about later contributors as about the original page author. I think this may be worthwhile eventually. See AuthorshipCredit.

-- BayleShanks

The [Internet Archive] is a possible solution. It seems like the desire articulated above is not for perfect versioning, but rather the ability to steal glimpses of occasional moments in a site's past, whether it be for first versions or occasional intermediate versions. See, for example, what our fearless leader was up to in [December 2000]. Wipe a tear at the sight of that familiar logo! ;-) Or witness [ZenWindow] (c.f. SacredSite) in its full 1995 glory. -- anon.


You may notice occasional unexpected loss of versions on MeatballWiki. The script has access to a limited amount of RAM, and so it may prematurely expire old versions if it cannot manipulate the entire version history. It is believed that OddMuse does not have this problem, since it retains individual versions in separate files.


A WritersLog. In order to support OpenMeatballWiki, the KeptPages should also now store the list of contributors, but only a list of IPs/domains/UserNames (not the timestamp nor the summary nor the sequence) who have edited the page as well as any of the links made on the page that happen to be CategoryHomePages. Bonus points for correlating the CategoryHomePages from the diff with the IPs/domains (in a separate list). From the UserInterface point of view, you may only want to list a Contributors: line in the history. The IPs/domains may be listed collectively as 'N anonymous' where N is the number of anonymous authors. This text would be a link that would generate the "hypothesized" list of correlations between IPs/domains and CategoryHomePages. -- SunirShah


permanent history means that vandalism can almost always be reversed, but some people are concerned about their mistakes becoming a permanent part of a page's history, thwarting ForgiveAndForget.

One can also implement site admins being able to delete individual history revisions (for nuking WikiSpam). If your admin's nice, I'm sure they might consider doing it to spare a user's embarassment about something particularly silly they wrote. Of course, you'd better hope you have a good GodKing. Overall, I think that the benefits of a FullHistory outweigh the disadvantages. Yes, the wiki equivalent of Wiki:GhostOfUsenetPostingsPast could be bad, but make the historical versions NotIndexed and you can minimise its effect for your users. (Maybe every wiki should have a sign on the edit page: ThinkBeforeHittingSave?.) -- EarleMartin


I'm very frustrated about the KeptPages implementation problems here. I lost some hours of work on ProblemSolving and BrainStorming because of spam / despam actions and the large gaps that exist in the page history. There is a 10-day-gap between rev.39 and rev.102 of ProblemSolving. That's really demotivating. -- HelmutLeitner

Hopefully the recent move to a database backup - with no KeptPages filesize limits - will solve this problem once and for all. -- ChrisPurcell

Discussion

MeatballWiki | RecentChanges | Random Page | Indices | Categories
Edit text of this page | View other revisions
Search: