User:Cacycle/diff
wikEd diff is a free JavaScript visual diff library for inline text comparisons. It is the only available JavaScript diff library that detects and highlights block moves and that works on the word and character level. While wikEd diff has been developed and optimized for comparing Wikipedia source texts, it works great for any type of text, including program code. The library is customizable, has Unicode and multilingual support, is fully commented and documented, and is free (public domain). The script is used by the Wikipedia/MediaWiki in-browser editor wikEd and by the gadget wikEdDiff. You can test the library by checking any of these gadgets in your English Wikipedia preferences. For easy text comparisons by copy-and-pasting you can also use the wikEd diff online tool and demo. The library has also been ported to PHP in the form of a MediaWiki extension.
Features
[edit]- Visual inline style, changes are shown in a single output text
- Block move detection and highlighting
- Resolution down to characters level
- Unicode and multilingual support
- Stepwise split (paragraphs, sentences, words, characters)
- Recursive diff
- Optimized code for resolving unmatched sequences
- Minimization of length of moved blocks
- Alignment of ambiguous unmatched sequences to next line break or word border
- Clipping of unchanged irrelevant parts from the output (optional)
- Fully customizable
- Text split optimized for MediaWiki source texts
- Well commented and documented code
Code
[edit]The JavaScript code can be found under User:Cacycle/diff.js (raw code).
Supported browsers
[edit]The code works for all browsers, including Internet Explorer 8 under Windows XP.
Usage
[edit]To call this library, please use the following code:
var wikEdDiff = new WikEdDiff();
var diffHtml = wikEdDiff.diff( oldVersionString, newVersionString );
To raw code of the library is available under:
https://en.wikipedia.org/w/index.php?title=User:Cacycle/diff.js&action=raw&ctype=text/javascript
To use the library in other JavaScript programs, add the following line to the HTML <head> tag:
<script src="//en.wikipedia.org/w/index.php?title=User:Cacycle/diff.js&action=raw&ctype=text/javascript"></script>
Customization
[edit]The library is fully customizable. The following settings can be changed on your User:YourUsername/common.js page. Please check the top of the diff.js program code for additional settings.Please see wikEdDiff for wikEdDiff-specific settings.
Colors can be customized by adding CSS code to your User:YourUsername/common.css page. Please check the code for available CSS classes.
Block mark symbols:
var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.cssMarkLeft = '◀';
wikEdDiffConfig.cssMarkRight = '▶';
Show complete un-clipped diff text:
var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.fullDiff = true;
Enable block move layout with highlighted blocks and marks at their original positions (default):
var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.showBlockMoves = true;
Reject blocks if they are too short and their words are not unique, prevents fragmentated diffs for very different versions (default):
var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.unlinkBlocks = true;
Reject blocks if shorter than this number of real words (default):
var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.blockMinLength = 3;
Enable character-refined diff (default):
var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.charDiff = true;
Enable recursive diff to resolve problematic sequences (default):
var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.recursiveDiff = true;
Maximum recursion depth (default):
var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.recursionMax = 10;
Display blocks in differing colors (rainbow color scheme):
var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.coloredBlocks = false;
Show debug infos and stats (block and group data object) in browser console:
var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.debug = true; }
Show timing results in browser console:
var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.debugTime = true; }
Run unit tests to prove correct working, display results in browser console:
var wikEdDiffConfig; if (wikEdDiffConfig === undefined) { wikEdDiffConfig = {}; }
wikEdDiffConfig.unitTesting = true;
Text type customization
[edit]The regular expressions for splitting the texts are optimized for natural language containing wiki markup. While this default also works great for other text types such as source code, it can be customized:
wikEdDiffConfig.regExp.split = {
paragraph: new RegExp('…', 'g'),
line: new RegExp('…', 'g'),
sentence: new RegExp('…', 'g'),
word: new RegExp('…', 'g'),
character: new RegExp('…', 'g')
};
Implementation
[edit]The code is an implementation and extension of the following basic algorithm:
- Paul Heckel: A technique for isolating differences between files
- Communications of the ACM 21(4):264 (1978)
- http://doi.acm.org/10.1145/386360.386367 (access restricted)
- Mirror: http://documents.scribd.com/docs/10ro9oowpo1h81pgh1as.pdf (open access)
This method is word-based and uses unique words as anchor points to identify matching text and moved blocks.
wikEd diff has additional code for resolving unmatched islands that are caused by common tokens at the border of sequences of common tokens or by sequence crossing-overs. From these matching data, the program extracts moved blocks and compiles a new text with added insertion, deletion, and moved blocks, and block marks.
Details
[edit]- Matching tokens (Heckel method):
- A list of real words and chunks is compiled from the old and the new text for later "unlinking" step
wordParse()
- Old and new text versions are split into tokens ("symbols", "words"), creating two token lists
splitText()
.tokens
- The number of token occurrences in each version are counted in the symbol tables
calculateDiff()
symbol
- Tokens unique in both texts serve as anchors or seed crystals and are connected ("linked", "matched") between old and new version
calculateDiff()
.tokens.link
- Starting from these connected pairs, adjacent identical pairs are connected outwards
- In addition to the classical Heckel method, the following improvements have been implemented:
- The process is repeated ("levels") while decreasing the size of unmatched tokens from paragraphs to lines, chunks (i.e. html and link markup), words, and single characters
- Words are split into characters only in corresponding unmatched regions ("gaps") of identical token number and strong similarity in all tokens:
splitRefineChars()
- Identical tokens (i.e. space separators) will be linked, resulting in word-wise character-level linking
- One token became connected or separated by space or dash (or any token) (axb → a b)
- Addition or deletion of flanking strings in tokens (xabx → ab)
- Addition or deletion of internal string in tokens (axb → ab)
- Same length and at least 50 % identity (abc → abx)
- Same start or end, same text longer than different text (abcd → abcxx)
- Same length and at least 50 % identity (abcd → abcxx)
- Words are split into characters only in corresponding unmatched regions ("gaps") of identical token number and strong similarity in all tokens:
- Gaps caused by "crossing-overs" are resolved by repeating the procedure at each level with a fresh symbol table for unlinked tokens
- Corresponding gaps caused by non-unique tokens on both sides of the gap borders are resolved by repeating the procedure at each level for each gap pair recursively
- Ambiguous gaps with the same front and back tokens are aligned with newlines or word borders if possible ("sliding")
slideGaps()
- The process is repeated ("levels") while decreasing the size of unmatched tokens from paragraphs to lines, chunks (i.e. html and link markup), words, and single characters
- A list of real words and chunks is compiled from the old and the new text for later "unlinking" step
- Extracting block information:
detectBlocks()
- Unchanged text blocks ('same' blocks, sequences of matched tokens) are added to the blocks table in new text order
getSameBlocks()
blocks
- To set blocks as fixed vs. moved, the longest sequences of blocks in increasing old text order are identified (longest increasing subsequence problem for moved blocks):
- Independent block sections with no block moves outside the sections are identified
getSections()
sections
- Continuous blocks in old text order are identified as groups
getGroups()
groups
- The longest subsequence of groups in sections in old text order is identified using a cached recursive method
findMaxPath()
and set fixedsetFixed()
- Blocks outside sections are set fixed
- Independent block sections with no block moves outside the sections are identified
- Unchanged text blocks ('same' blocks, sequences of matched tokens) are added to the blocks table in new text order
- Preventing fragmented outputs caused by false positive matchings in texts with many changes:
unlinkBlocks()
- Unchanged blocks are converted into insertion/deletion pairs in token lists ("unlinking"):
- Unlink whole group if no block is shorter than three real words
blockMinLength
and contains no word unique to the whole text - Unlink group blocks stepwise from the front and the end until a block contains more than one real word or a word unique to the whole text
- Unlink whole group if no block is shorter than three real words
- After unlinking, the whole process of extracting block information from the token lists has to be repeated
- Unlinking is repeated as long as possible
- Unchanged blocks are converted into insertion/deletion pairs in token lists ("unlinking"):
- Deletion blocks ('del' blocks, unmatched sequences of text blocks from the old text token list) are added to the blocks table
getDelBlocks()
- Deletion blocks are sorted-in in between the unchanged blocks at the correct position in new text order:
positionDelBlocks()
- Deletion blocks that are next to a fixed neighbor block in old text order are positioned next to that block
- Deletion blocks without a fixed neighbor in old text order are positioned next to a neighbor if that block is not the start or end of a group
- Alternatively, the deletion block is positioned next to the closest fixed neighbor to the left
- Deletion blocks around unchanged blocks will be arranged in order of their old text positions
- Insertion blocks ('ins' blocks, unmatched text blocks from the new text token list) are added to the blocks table
getInsBlocks()
- Insertion block group numbers and fixed status are set:
setInsGroups()
- If the insertion is inside a group, that group number and fixed status is used
- For insertions outside existing groups a new single-block group is created
- Adding marks at the original position of moved groups ('mark' blocks) to the blocks table:
insertMarks()
- This is similar to positioning deletion blocks
- Marks of moved goups that are next to a fixed group in old text order are positioned next to that group
- Alternatively, the mark is positioned next to the closest fixed group to the left
- Mark blocks and deletion blocks around unchanged blocks will be arranged in order of their old text positions
- Moved group colors are set in order of the marks in the new text
- Collect diff fragment list for markup (abstraction layer for customized diffs):
getDiffFragments()
- Groups are processed in new text order
- If block moves are not displayed as per configuration
showBlockMoves
, then this text is inserted as a deletion - Otherwise, a mark is added with this text as a popup title
- Clip unchanged sections from unmoved block text:
clipDiffFragments()
- Unmoved block text is clipped at the closest feature that is detected from the borders (lines, heading, paragraph, line break, blank, fixed number of chars)
- Create html formatted diff code from:
getDiffHtml()
See also
[edit]- wikEd, a full-featured MediaWiki-integrated text editor that adds enhanced text processing functions to edit pages. wikEd provides syntax highlighting, search and replace functions, and on-page previews and diff views
- wikEdDiff, uses this library to provides an improved and easier to read diff view for comparing article versions
- wikEd diff online tool and demo
- wikEdDiff extension
License
[edit]The code has been released into the public domain.
I, the copyright holder of this work, hereby release it into the public domain. This applies worldwide. If this is not legally possible: |