Skip to content

Fix diffWords handling of whitespace #497

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 56 commits into from
Mar 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
527d64b
Add comment outlining the new logic I intend to implement
ExplodingCabbage Feb 19, 2024
ac8dc4a
Merge remote-tracking branch 'origin/master' into fix-diffWords
ExplodingCabbage Feb 21, 2024
3eb79a2
Implement the plan
ExplodingCabbage Feb 21, 2024
b01be29
More progress
ExplodingCabbage Feb 21, 2024
e039ae9
Further WIP
ExplodingCabbage Feb 22, 2024
77d6741
Add utils file I forgot to commit
ExplodingCabbage Feb 22, 2024
9208174
Fix tokenize test
ExplodingCabbage Feb 22, 2024
a5e67fe
Add some extra tests
ExplodingCabbage Feb 22, 2024
b72b4cc
Fix some bugs
ExplodingCabbage Mar 5, 2024
2f58b02
Fix more bugs
ExplodingCabbage Mar 5, 2024
2a16465
Remove a test by rolling it into an earlier one
ExplodingCabbage Mar 5, 2024
577b25d
Fix another test
ExplodingCabbage Mar 5, 2024
6ef6821
Update another test
ExplodingCabbage Mar 5, 2024
77b1efe
Eliminate test duplication
ExplodingCabbage Mar 5, 2024
ea2d181
Move anchor test to be for diffWordsWithSpace since it no longer make…
ExplodingCabbage Mar 5, 2024
513e865
Merge branch 'master' into fix-diffWords
ExplodingCabbage Mar 5, 2024
1caa0d5
Fix tests of counts
ExplodingCabbage Mar 5, 2024
68c6505
Update another test
ExplodingCabbage Mar 5, 2024
e170ade
Nuke test of removed ignoreWhitespace flag
ExplodingCabbage Mar 5, 2024
221365e
Add TODO
ExplodingCabbage Mar 5, 2024
144a26d
Support async
ExplodingCabbage Mar 6, 2024
130fc56
Add TODO
ExplodingCabbage Mar 6, 2024
45968a9
Fix newline handling in diffWordsWithSpace
ExplodingCabbage Mar 6, 2024
7387d1f
Merge remote-tracking branch 'origin/master' into fix-diffWords
ExplodingCabbage Mar 6, 2024
ebcbb6d
Fix a minor oopsie (but it doesn't fix the test failure...)
ExplodingCabbage Mar 6, 2024
1b5595c
Fix test failure about Windows newlines
ExplodingCabbage Mar 6, 2024
76ce2a3
Document postProcess
ExplodingCabbage Mar 6, 2024
51f7cf5
Remove done TODO
ExplodingCabbage Mar 6, 2024
8f02ce9
Add (failing) test of one-change-per-token mode
ExplodingCabbage Mar 6, 2024
aaa865c
Get test passing
ExplodingCabbage Mar 6, 2024
0fd6c09
Fix typo
ExplodingCabbage Mar 6, 2024
3108dfa
Tweak code to improve coverage stat
ExplodingCabbage Mar 6, 2024
1a26257
Remove unreachable code to placate istanbul
ExplodingCabbage Mar 6, 2024
2c3ccd8
Add an extra test to improve coverage
ExplodingCabbage Mar 6, 2024
29eff79
Replace for...of loops with .forEach to placate istanbul
ExplodingCabbage Mar 6, 2024
5cbb17a
Begin testing string utils
ExplodingCabbage Mar 6, 2024
d09ace3
Add another test; fix serious bugs
ExplodingCabbage Mar 6, 2024
493fbd2
Test another util
ExplodingCabbage Mar 6, 2024
c36e533
Test more
ExplodingCabbage Mar 6, 2024
42da26e
Add more tests
ExplodingCabbage Mar 6, 2024
5bda111
Test more
ExplodingCabbage Mar 7, 2024
549ed5f
Remove TODOs
ExplodingCabbage Mar 7, 2024
f1398fa
Update docs
ExplodingCabbage Mar 7, 2024
f892d72
Remove unneeded generateOptions calls (and improve test coverage as a…
ExplodingCabbage Mar 7, 2024
3da0adb
Purge now-dead code
ExplodingCabbage Mar 7, 2024
c2e9484
Add test to placate coverage metric
ExplodingCabbage Mar 7, 2024
dd43b1f
Add yet another test to fix a util coverage gap I'd introduced
ExplodingCabbage Mar 7, 2024
1827219
Bold changes in release notes
ExplodingCabbage Mar 8, 2024
eab93c6
Add test of ignoreWhitespace option, and restore it
ExplodingCabbage Mar 8, 2024
3f8e9b7
Add release notes
ExplodingCabbage Mar 8, 2024
94f5958
Remove unfinished sentence
ExplodingCabbage Mar 8, 2024
11237c4
Test and document inadvertent (but welcome) fix I accidentally made t…
ExplodingCabbage Mar 8, 2024
75d99ab
Fix typo
ExplodingCabbage Mar 8, 2024
a8bc543
Add more tests and fix more bugs about whitespace
ExplodingCabbage Mar 8, 2024
4304c28
Tweak a lying comment
ExplodingCabbage Mar 8, 2024
05c4775
Merge remote-tracking branch 'origin/master' into fix-diffWords
ExplodingCabbage Mar 11, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,16 +36,14 @@ Broadly, jsdiff's diff functions all take an old text and a new text and perform
Options
* `ignoreCase`: If `true`, the uppercase and lowercase forms of a character are considered equal. Defaults to `false`.

* `Diff.diffWords(oldStr, newStr[, options])` - diffs two blocks of text, treating each word and each word separator (punctuation, newline, or run of whitespace) as a token.

(Whitespace-only tokens are automatically treated as equal to each other, so changes like changing a space to a newline or a run of multiple spaces will be ignored.)
* `Diff.diffWords(oldStr, newStr[, options])` - diffs two blocks of text, treating each word and each punctuation mark as a token. Whitespace is ignored when computing the diff (but preserved as far as possible in the final change objects).

Returns a list of [change objects](#change-objects).

Options
* `ignoreCase`: Same as in `diffChars`. Defaults to false.

* `Diff.diffWordsWithSpace(oldStr, newStr[, options])` - same as `diffWords`, except whitespace-only tokens are not automatically considered equal, so e.g. changing a space to a tab is considered a change.
* `Diff.diffWordsWithSpace(oldStr, newStr[, options])` - diffs two blocks of text, treating each word, punctuation mark, newline, or run of (non-newline) whitespace as a token.

* `Diff.diffLines(oldStr, newStr[, options])` - diffs two blocks of text, treating each line as a token.

Expand Down Expand Up @@ -184,6 +182,7 @@ For even more customisation of the diffing behavior, you can create a `new Diff.
* `removeEmpty(array)`: called on the arrays of tokens returned by `tokenize` and can be used to modify them. Defaults to stripping out falsey tokens, such as empty strings. `diffArrays` overrides this to simply return the `array`, which means that falsey values like empty strings can be handled like any other token by `diffArrays`.
* `equals(left, right, options)`: called to determine if two tokens (one from the old string, one from the new string) should be considered equal. Defaults to comparing them with `===`.
* `join(tokens)`: gets called with an array of consecutive tokens that have either all been added, all been removed, or are all common. Needs to join them into a single value that can be used as the `value` property of the [change object](#change-objects) for these tokens. Defaults to simply returning `tokens.join('')`.
* `postProcess(changeObjects)`: gets called at the end of the algorithm with the [change objects](#change-objects) produced, and can do final cleanups on them. Defaults to simply returning `changeObjects` unchanged.

### Change Objects
Many of the methods above return change objects. These objects consist of the following fields:
Expand Down
26 changes: 17 additions & 9 deletions release-notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,17 +4,25 @@

[Commits](https://github.com/kpdecker/jsdiff/compare/master...v6.0.0-staging)

- [#435](https://github.com/kpdecker/jsdiff/pull/435) Fix `parsePatch` handling of control characters. `parsePatch` used to interpret various unusual control characters - namely vertical tabs, form feeds, lone carriage returns without a line feed, and EBCDIC NELs - as line breaks when parsing a patch file. This was inconsistent with the behavior of both JsDiff's own `diffLines` method and also the Unix `diff` and `patch` utils, which all simply treat those control characters as ordinary characters. The result of this discrepancy was that some well-formed patches - produced either by `diff` or by JsDiff itself and handled properly by the `patch` util - would be wrongly parsed by `parsePatch`, with the effect that it would disregard the remainder of a hunk after encountering one of these control characters.
- [#497](https://github.com/kpdecker/jsdiff/pull/497) **`diffWords` behavior has been radically changed.** Previously, even with `ignoreWhitespace: true`, runs of whitespace were tokens, which led to unhelpful and unintuitive diffing behavior in typical texts. Specifically, even when two texts contained overlapping passages, `diffWords` would sometimes choose to delete all the words from the old text and insert them anew in their new positions in order to avoid having to delete or insert whitespace tokens. Whitespace sequences are no longer tokens as of this release, which affects both the generated diffs and the `count`s.

Runs of whitespace are still tokens in `diffWordsWithSpace`.

As part of the changes to `diffWords`, **a new `.postProcess` method has been added on the base `Diff` type**, which can be overridden in custom `Diff` implementations.

**`diffLines` with `ignoreWhitespace: true` will no longer ignore the insertion or deletion of entire extra lines of whitespace at the end of the text**. Previously, these would not show up as insertions or deletions, as a side effect of a hack in the base diffing algorithm meant to help ignore whitespace in `diffWords`. More generally, **the undocumented special handling in the core algorithm for ignored terminals has been removed entirely.** (This special case behavior used to rewrite the final two change objects in a scenario where the final change object was an addition or deletion and its `value` was treated as equal to the empty string when compared using the diff object's `.equals` method.)

- [#500](https://github.com/kpdecker/jsdiff/pull/500) **`diffChars` now diffs Unicode code points** instead of UTF-16 code units.
- [#439](https://github.com/kpdecker/jsdiff/pull/439) Prefer diffs that order deletions before insertions. When faced with a choice between two diffs with an equal total edit distance, the Myers diff algorithm generally prefers one that does deletions before insertions rather than insertions before deletions. For instance, when diffing `abcd` against `acbd`, it will prefer a diff that says to delete the `b` and then insert a new `b` after the `c`, over a diff that says to insert a `c` before the `b` and then delete the existing `c`. JsDiff deviated from the published Myers algorithm in a way that led to it having the opposite preference in many cases, including that example. This is now fixed, meaning diffs output by JsDiff will more accurately reflect what the published Myers diff algorithm would output.
- [#455](https://github.com/kpdecker/jsdiff/pull/455) The `added` and `removed` properties of change objects are now guaranteed to be set to a boolean value. (Previously, they would be set to `undefined` or omitted entirely instead of setting them to false.)
- [#435](https://github.com/kpdecker/jsdiff/pull/435) **Fix `parsePatch` handling of control characters.** `parsePatch` used to interpret various unusual control characters - namely vertical tabs, form feeds, lone carriage returns without a line feed, and EBCDIC NELs - as line breaks when parsing a patch file. This was inconsistent with the behavior of both JsDiff's own `diffLines` method and also the Unix `diff` and `patch` utils, which all simply treat those control characters as ordinary characters. The result of this discrepancy was that some well-formed patches - produced either by `diff` or by JsDiff itself and handled properly by the `patch` util - would be wrongly parsed by `parsePatch`, with the effect that it would disregard the remainder of a hunk after encountering one of these control characters.
- [#439](https://github.com/kpdecker/jsdiff/pull/439) **Prefer diffs that order deletions before insertions.** When faced with a choice between two diffs with an equal total edit distance, the Myers diff algorithm generally prefers one that does deletions before insertions rather than insertions before deletions. For instance, when diffing `abcd` against `acbd`, it will prefer a diff that says to delete the `b` and then insert a new `b` after the `c`, over a diff that says to insert a `c` before the `b` and then delete the existing `c`. JsDiff deviated from the published Myers algorithm in a way that led to it having the opposite preference in many cases, including that example. This is now fixed, meaning diffs output by JsDiff will more accurately reflect what the published Myers diff algorithm would output.
- [#455](https://github.com/kpdecker/jsdiff/pull/455) **The `added` and `removed` properties of change objects are now guaranteed to be set to a boolean value.** (Previously, they would be set to `undefined` or omitted entirely instead of setting them to false.)
- [#464](https://github.com/kpdecker/jsdiff/pull/464) Specifying `{maxEditLength: 0}` now sets a max edit length of 0 instead of no maximum.
- [#460](https://github.com/kpdecker/jsdiff/pull/460) Added `oneChangePerToken` option.
- [#467](https://github.com/kpdecker/jsdiff/pull/467) When passing a `comparator(left, right)` to `diffArrays`, values from the old array will now consistently be passed as the first argument (`left`) and values from the new array as the second argument (`right`). Previously this was almost (but not quite) always the other way round.
- [#480](https://github.com/kpdecker/jsdiff/pull/480) Passing `maxEditLength` to `createPatch` & `createTwoFilesPatch` now works properly (i.e. returns undefined if the max edit distance is exceeded; previous behavior was to crash with a `TypeError` if the edit distance was exceeded).
- [#486](https://github.com/kpdecker/jsdiff/pull/486) The `ignoreWhitespace` option of `diffLines` behaves more sensibly now. `value`s in returned change objects now include leading/trailing whitespace even when `ignoreWhitespace` is used, just like how with `ignoreCase` the `value`s still reflect the case of one of the original texts instead of being all-lowercase. `ignoreWhitespace` is also now compatible with `newlineIsToken`. Finally, `diffTrimmedLines` is deprecated (and removed from the docs) in favour of using `diffLines` with `ignoreWhitespace: true`; the two are, and always have been, equivalent.
- [#490](https://github.com/kpdecker/jsdiff/pull/490) When calling diffing functions in async mode by passing a `callback` option, the diff result will now be passed as the *first* argument to the callback instead of the second. (Previously, the first argument was never used at all and would always have value `undefined`.)
- [#489](github.com/kpdecker/jsdiff/pull/489) `this.options` no longer exists on `Diff` objects. Instead, `options` is now passed as an argument to methods that rely on options, like `equals(left, right, options)`. This fixes a race condition in async mode, where diffing behaviour could be changed mid-execution if a concurrent usage of the same `Diff` instances overwrote its `options`.
- [#460](https://github.com/kpdecker/jsdiff/pull/460) **Added `oneChangePerToken` option.**
- [#467](https://github.com/kpdecker/jsdiff/pull/467) **Consistent ordering of arguments to `comparator(left, right)`.** Values from the old array will now consistently be passed as the first argument (`left`) and values from the new array as the second argument (`right`). Previously this was almost (but not quite) always the other way round.
- [#480](https://github.com/kpdecker/jsdiff/pull/480) **Passing `maxEditLength` to `createPatch` & `createTwoFilesPatch` now works properly** (i.e. returns undefined if the max edit distance is exceeded; previous behavior was to crash with a `TypeError` if the edit distance was exceeded).
- [#486](https://github.com/kpdecker/jsdiff/pull/486) **The `ignoreWhitespace` option of `diffLines` behaves more sensibly now.** `value`s in returned change objects now include leading/trailing whitespace even when `ignoreWhitespace` is used, just like how with `ignoreCase` the `value`s still reflect the case of one of the original texts instead of being all-lowercase. `ignoreWhitespace` is also now compatible with `newlineIsToken`. Finally, **`diffTrimmedLines` is deprecated** (and removed from the docs) in favour of using `diffLines` with `ignoreWhitespace: true`; the two are, and always have been, equivalent.
- [#490](https://github.com/kpdecker/jsdiff/pull/490) **When calling diffing functions in async mode by passing a `callback` option, the diff result will now be passed as the *first* argument to the callback instead of the second.** (Previously, the first argument was never used at all and would always have value `undefined`.)
- [#489](github.com/kpdecker/jsdiff/pull/489) **`this.options` no longer exists on `Diff` objects.** Instead, `options` is now passed as an argument to methods that rely on options, like `equals(left, right, options)`. This fixes a race condition in async mode, where diffing behaviour could be changed mid-execution if a concurrent usage of the same `Diff` instances overwrote its `options`.

## v5.2.0

Expand Down
27 changes: 7 additions & 20 deletions src/diff/base.js
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ Diff.prototype = {
let self = this;

function done(value) {
value = self.postProcess(value, options);
if (callback) {
setTimeout(function() { callback(value); }, 0);
return true;
Expand Down Expand Up @@ -41,7 +42,7 @@ Diff.prototype = {
let newPos = this.extractCommon(bestPath[0], newString, oldString, 0, options);
if (bestPath[0].oldPos + 1 >= oldLen && newPos + 1 >= newLen) {
// Identity per the equality and tokenizer
return done(buildValues(self, bestPath[0].lastComponent, newString, oldString, self.useLongestToken, options));
return done(buildValues(self, bestPath[0].lastComponent, newString, oldString, self.useLongestToken));
}

// Once we hit the right edge of the edit graph on some diagonal k, we can
Expand Down Expand Up @@ -105,7 +106,7 @@ Diff.prototype = {

if (basePath.oldPos + 1 >= oldLen && newPos + 1 >= newLen) {
// If we have hit the end of both strings, then we are done
return done(buildValues(self, basePath.lastComponent, newString, oldString, self.useLongestToken, options));
return done(buildValues(self, basePath.lastComponent, newString, oldString, self.useLongestToken));
} else {
bestPath[diagonalPath] = basePath;
if (basePath.oldPos + 1 >= oldLen) {
Expand Down Expand Up @@ -209,10 +210,13 @@ Diff.prototype = {
},
join(chars) {
return chars.join('');
},
postProcess(changeObjects) {
return changeObjects;
}
};

function buildValues(diff, lastComponent, newString, oldString, useLongestToken, options) {
function buildValues(diff, lastComponent, newString, oldString, useLongestToken) {
// First we convert our linked list of components in reverse order to an
// array in the right order:
const components = [];
Expand Down Expand Up @@ -256,22 +260,5 @@ function buildValues(diff, lastComponent, newString, oldString, useLongestToken,
}
}

// Special case handle for when one terminal is ignored (i.e. whitespace).
// For this case we merge the terminal into the prior string and drop the change.
// This is only available for string mode.
let finalComponent = components[componentLen - 1];
if (
componentLen > 1
&& typeof finalComponent.value === 'string'
&& (
(finalComponent.added && diff.equals('', finalComponent.value, options))
||
(finalComponent.removed && diff.equals(finalComponent.value, '', options))
)
) {
components[componentLen - 2].value += finalComponent.value;
components.pop();
}

return components;
}
Loading