Op-ed

Automated copy-and-paste detection under trial

One of the problems Wikipedia faces is users who add content copied and pasted verbatim from sources. When we follow up on a person's work, we often don't check for this, and a few editors have managed to make thousands of edits over many years before concerns are detected. In the past year, I've picked up three or four editors who have made many thousands of edits to medical topics in which their additions contain word-for-word copy from elsewhere. Most of those who only make a few edits of this nature are usually never detected.

After a user detects this kind of editing, clean-up involves going through all their edits and occasionally reverting dozens of articles. Unfortunately, sometimes it means going back to how an article was years back, resulting in the loss of the efforts of the many editors who came after them. Contingency reverts can end up harming overall article quality and frustrate the core editing community. What is the point of contributing to Wikipedia if it's simply a collection of copyright-infringed text cobbled together, and even your own original contributions disappear in the cleanup? Worse, the fallout can cause editors to retire. If we could have caught them early and explained the issues to them, we'd not only save a huge amount of work later on, but might retain editors who are willing to put in a great deal of time.

So what is the solution? In my opinion, we need near real-time automated analysis and detection of copyright concerns. I'd been trying to find someone to develop such a tool for more than two years; then, at Wikimania in London, I managed to corner a pywikibot programmer, ValHallASW, and convinced him to do a little work. This was followed by meeting a wonderful Israeli instructor from the Sackler School of Medicine Shani Evenstein who knew two incredibly able programmers, User:Eran and User:Ravid ziv. By the end of Wikimania our impromptu team had produced a basic bot – User:EranBot – that does what I'd envisioned. It works by taking all edits over a certain size and running them through Turnitin / iThenticate. Edits that come back positive are listed for human follow-up. Development of this idea began back in March of 2012 by User:Ocaasi and can be seen here.

Why near real time?

Determining copy-and-paste issues becomes more difficult the longer one waits between the initial edit and the checking, as one then has to deal with mirroring of Wikipedia content across the Internet. As well, many reliable sources – including peer-reviewed journals and textbooks – have begun borrowing liberally from Wikipedia without attribution. So if we're looking at copyright issues six months or a year down the road, we need to look at publication dates and go back in the article history to determine who is copying from whom.

In short, it's far more difficult for both humans and machines.

Why Turnitin?

Turnitin is an Internet-based plagiarism-prevention service created by iParadigms, LLC, first launched in 1997; it is one of the strategies used by some universities and schools to minimise plagiarism in student writing. The company that developed and owns the product has agreed to give us free access to their tools and API. Even though it's a for-profit company, there won't be obtrusive links from Wikipedia to their site, and no advertising for them will ever appear on Wikipedia.

Why would they want to be involved with us? Letting us use their tools doesn't cost them anything and is no disadvantage to shareholders. Some companies are willing to help us just because they like what we do. We've had a number of publishers donate large numbers of accounts to Wikipedians for similar reasons. They have extra capacity just sitting there, so why not give it away? They also know we're volunteers and are not going to buy their capacity anyway. Other options could include Google, but they don't allow their services to be used in this way, and it appears that Yahoo is currently charging for use by User:CorenSearchBot, which checks new articles for issues.

Benefits

How many edits are we looking at? Currently the bot is running only on the English Wikipedia's medical articles. In 2013, there were 400,000 edits to medical content – around 1,100 edits per day. Of these only about 10% are of significant size and not a revert, so we're looking at an average of around maybe 100 edits per day. If we assume a 10% rate of copyright concerns and three times as many false positives as true positives, we're looking at 40 edits per day at most. Who would follow-up? With the number of concerning edits in the range of 40 per day, members of WikiProject Medicine will be able to handle the load. This is much easier than catching 30,000 edits of copyright infringement after the fact, with clean-up taking many of us away from writing content for many days.

The Wiki Education Foundation has expressed interest in the development of this tool, since edits by students have previously contained significant amounts of plagiarism, kindling much discontent with Wiki Education's predecessor. The Hebrew Wikipedia is also currently working with this bot, and we'd be happy to see other topic areas and WMF language sites use it.

There are still a few rough aspects to iron out. The parsing out of the new text added by an edit is not as good as it could be. Reverts should be ignored. These issues are fairly minor to address, and a number have already been dealt with. While there were initially about three false positives for every true positive, we should have this down to a more even 50–50 split by the end of the week. Already in its early stages, this has turned out to be an exceedingly useful tool.

The views expressed in this opinion piece are those of the author only; responses and critical commentary are invited in the comments section. Editors wishing to propose their own Signpost contribution should email the Signpost's editor in chief.