Skip to content

Fix the regressions introduced in the fix for #89 #120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 71 additions & 56 deletions diffmatchpatch/diff.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,8 +34,6 @@ const (
DiffInsert Operation = 1
// DiffEqual item represents an equal diff.
DiffEqual Operation = 0
//IndexSeparator is used to seperate the array indexes in an index string
IndexSeparator = ","
)

// Diff represents one diff operation
Expand Down Expand Up @@ -195,7 +193,7 @@ func (dmp *DiffMatchPatch) diffCompute(text1, text2 []rune, checklines bool, dea
// diffLineMode does a quick line-level diff on both []runes, then rediff the parts for greater accuracy. This speedup can produce non-minimal diffs.
func (dmp *DiffMatchPatch) diffLineMode(text1, text2 []rune, deadline time.Time) []Diff {
// Scan the text on a line-by-line basis first.
text1, text2, linearray := dmp.DiffLinesToRunes(string(text1), string(text2))
text1, text2, linearray := dmp.diffLinesToRunes(text1, text2)

diffs := dmp.diffMainRunes(text1, text2, false, deadline)

Expand Down Expand Up @@ -392,28 +390,88 @@ func (dmp *DiffMatchPatch) diffBisectSplit(runes1, runes2 []rune, x, y int,
// DiffLinesToChars splits two texts into a list of strings, and educes the texts to a string of hashes where each Unicode character represents one line.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/educes/reduces

// It's slightly faster to call DiffLinesToRunes first, followed by DiffMainRunes.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still accurate?

func (dmp *DiffMatchPatch) DiffLinesToChars(text1, text2 string) (string, string, []string) {
chars1, chars2, lineArray := dmp.diffLinesToStrings(text1, text2)
return chars1, chars2, lineArray
chars1, chars2, lineArray := dmp.DiffLinesToRunes(text1, text2)
return string(chars1), string(chars2), lineArray
}

// DiffLinesToRunes splits two texts into a list of runes.
// DiffLinesToRunes splits two texts into a list of runes. Each rune represents one line.
func (dmp *DiffMatchPatch) DiffLinesToRunes(text1, text2 string) ([]rune, []rune, []string) {
chars1, chars2, lineArray := dmp.diffLinesToStrings(text1, text2)
return []rune(chars1), []rune(chars2), lineArray
// '\x00' is a valid character, but various debuggers don't like it. So we'll insert a junk entry to avoid generating a null character.
lineArray := []string{""} // e.g. lineArray[4] == 'Hello\n'
lineHash := map[string]int{} // e.g. lineHash['Hello\n'] == 4

chars1 := dmp.diffLinesToRunesMunge(text1, &lineArray, lineHash)
chars2 := dmp.diffLinesToRunesMunge(text2, &lineArray, lineHash)

return chars1, chars2, lineArray
}

func (dmp *DiffMatchPatch) diffLinesToRunes(text1, text2 []rune) ([]rune, []rune, []string) {
return dmp.DiffLinesToRunes(string(text1), string(text2))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about all this unnecessary allocation converting to string, when it really shouldn't be necessary.

Did you compare benchmark performance?

}

// diffLinesToRunesMunge splits a text into an array of strings, and reduces the texts to a []rune where each Unicode character represents one line.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The words 'array', 'list', and 'slice' are used interchangeably throughout, but array is really incorrect in this context.

// We use strings instead of []runes as input mainly because you can't use []rune as a map key.
func (dmp *DiffMatchPatch) diffLinesToRunesMunge(text string, lineArray *[]string, lineHash map[string]int) []rune {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm staring at this, and I don't see how the previous algorithm worked without passing in lineHash. Is that the root cause of the bug?

Can you explain somewhere the exact nature of the bug in the fix for #89?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the root cause of the bug is described in the PR description:

The current solution doesn't work well with (*DiffMatchPatch).diffMainRunes method because array indexes in the index string occupy multiple runes.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

...so array indexes could be corrupted/split in the diff output, right?

It's worth explaining in a comment.

// Walk the text, pulling out a substring for each line. text.split('\n') would would temporarily double our memory footprint. Modifying text would create many large strings to garbage collect.
lineStart := 0
lineEnd := -1
runes := []rune{}

for lineEnd < len(text)-1 {
lineEnd = indexOf(text, "\n", lineStart)

if lineEnd == -1 {
lineEnd = len(text) - 1
}

line := text[lineStart : lineEnd+1]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'd be interested to know how performance would be affected by using an actual hash to look up the rune value in the index, rather than a string. Then we could convert to []byte at API boundaries and avoid allocating again.

But that doesn't need to be done for this PR.

lineStart = lineEnd + 1
lineValue, ok := lineHash[line]
if !ok {
checkLineArray(lineArray)

*lineArray = append(*lineArray, line)
lineValue = len(*lineArray) - 1
lineHash[line] = lineValue
}
runes = append(runes, rune(lineValue))
}

return runes
}

// checkLineArray checks the size of the slice and ensures that the index of the next element
// will be the valid rune.
func checkLineArray(a *[]string) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have functions

// lineRune returns a valid rune value for the line at the given index
func lineRune(idx int) rune

// runeLine looks up the given rune hash in the lines slice.
func runeLine(idxRune rune, lines []string) string

Thereby avoiding the padding here. It could also be used to avoid the prepended "".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it can break the user code because DiffLinesToRunes returns the lineArray. If a user has something like DiffCharsToLines, it can panic with index out of range.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that DiffLinesToRunes could use lineRune to convert to rune values, and DiffCharsToLines can use runeLine to convert to line indexes. These functions would be inverses of eachother.

// Runes in this range are invalid, utf8.ValidRune() returns false.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very subtle. Please explain further in this comment why this is a problem.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment describes why runes must be valid: #89 (comment)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood. I think it's worth explaining here (you can reference #89).

const (
surrogateMin = 0xD800
surrogateMax = 0xDFFF
)

// Check the index of the next element
switch len(*a) {
case surrogateMin:
// Skip invalid runes.
padding := [surrogateMax - surrogateMin + 1]string{}
*a = append(*a, padding[:]...)

case utf8.MaxRune + 1:
// We can't do anything about it.
panic(fmt.Sprintf("rune can't be more than %d", utf8.MaxRune))
}
}

// DiffCharsToLines rehydrates the text in a diff from a string of line hashes to real lines of text.
func (dmp *DiffMatchPatch) DiffCharsToLines(diffs []Diff, lineArray []string) []Diff {
hydrated := make([]Diff, 0, len(diffs))
for _, aDiff := range diffs {
chars := strings.Split(aDiff.Text, IndexSeparator)
chars := aDiff.Text
text := make([]string, len(chars))

for i, r := range chars {
i1, err := strconv.Atoi(r)
if err == nil {
text[i] = lineArray[i1]
}
text[i] = lineArray[r]
}

aDiff.Text = strings.Join(text, "")
Expand Down Expand Up @@ -1307,46 +1365,3 @@ func (dmp *DiffMatchPatch) DiffFromDelta(text1 string, delta string) (diffs []Di

return diffs, nil
}

// diffLinesToStrings splits two texts into a list of strings. Each string represents one line.
func (dmp *DiffMatchPatch) diffLinesToStrings(text1, text2 string) (string, string, []string) {
// '\x00' is a valid character, but various debuggers don't like it. So we'll insert a junk entry to avoid generating a null character.
lineArray := []string{""} // e.g. lineArray[4] == 'Hello\n'

//Each string has the index of lineArray which it points to
strIndexArray1 := dmp.diffLinesToStringsMunge(text1, &lineArray)
strIndexArray2 := dmp.diffLinesToStringsMunge(text2, &lineArray)

return intArrayToString(strIndexArray1), intArrayToString(strIndexArray2), lineArray
}

// diffLinesToStringsMunge splits a text into an array of strings, and reduces the texts to a []string.
func (dmp *DiffMatchPatch) diffLinesToStringsMunge(text string, lineArray *[]string) []uint32 {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reorganize to minimize the diff with this refactored method.

// Walk the text, pulling out a substring for each line. text.split('\n') would would temporarily double our memory footprint. Modifying text would create many large strings to garbage collect.
lineHash := map[string]int{} // e.g. lineHash['Hello\n'] == 4
lineStart := 0
lineEnd := -1
strs := []uint32{}

for lineEnd < len(text)-1 {
lineEnd = indexOf(text, "\n", lineStart)

if lineEnd == -1 {
lineEnd = len(text) - 1
}

line := text[lineStart : lineEnd+1]
lineStart = lineEnd + 1
lineValue, ok := lineHash[line]

if ok {
strs = append(strs, uint32(lineValue))
} else {
*lineArray = append(*lineArray, line)
lineHash[line] = len(*lineArray) - 1
strs = append(strs, uint32(len(*lineArray)-1))
}
}

return strs
}
67 changes: 47 additions & 20 deletions diffmatchpatch/diff_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -314,10 +314,10 @@ func TestDiffLinesToChars(t *testing.T) {
dmp := New()

for i, tc := range []TestCase{
{"", "alpha\r\nbeta\r\n\r\n\r\n", "", "1,2,3,3", []string{"", "alpha\r\n", "beta\r\n", "\r\n"}},
{"a", "b", "1", "2", []string{"", "a", "b"}},
{"", "alpha\r\nbeta\r\n\r\n\r\n", "", "\u0001\u0002\u0003\u0003", []string{"", "alpha\r\n", "beta\r\n", "\r\n"}},
{"a", "b", "\u0001", "\u0002", []string{"", "a", "b"}},
// Omit final newline.
{"alpha\nbeta\nalpha", "", "1,2,3", "", []string{"", "alpha\n", "beta\n", "alpha"}},
{"alpha\nbeta\nalpha", "", "\u0001\u0002\u0003", "", []string{"", "alpha\n", "beta\n", "alpha"}},
} {
actualChars1, actualChars2, actualLines := dmp.DiffLinesToChars(tc.Text1, tc.Text2)
assert.Equal(t, tc.ExpectedChars1, actualChars1, fmt.Sprintf("Test case #%d, %#v", i, tc))
Expand All @@ -330,14 +330,14 @@ func TestDiffLinesToChars(t *testing.T) {
lineList := []string{
"", // Account for the initial empty element of the lines array.
}
var charList []string
var charList []rune
for x := 1; x < n+1; x++ {
lineList = append(lineList, strconv.Itoa(x)+"\n")
charList = append(charList, strconv.Itoa(x))
charList = append(charList, rune(x))
}
lines := strings.Join(lineList, "")
chars := strings.Join(charList[:], ",")
assert.Equal(t, n, len(strings.Split(chars, ",")))
chars := string(charList)
assert.Equal(t, n, utf8.RuneCountInString(chars))

actualChars1, actualChars2, actualLines := dmp.DiffLinesToChars(lines, "")
assert.Equal(t, chars, actualChars1)
Expand All @@ -358,8 +358,8 @@ func TestDiffCharsToLines(t *testing.T) {
for i, tc := range []TestCase{
{
Diffs: []Diff{
{DiffEqual, "1,2,1"},
{DiffInsert, "2,1,2"},
{DiffEqual, "\u0001\u0002\u0001"},
{DiffInsert, "\u0002\u0001\u0002"},
},
Lines: []string{"", "alpha\n", "beta\n"},

Expand All @@ -378,15 +378,14 @@ func TestDiffCharsToLines(t *testing.T) {
lineList := []string{
"", // Account for the initial empty element of the lines array.
}
charList := []string{}
charList := []rune{}
for x := 1; x <= n; x++ {
lineList = append(lineList, strconv.Itoa(x)+"\n")
charList = append(charList, strconv.Itoa(x))
charList = append(charList, rune(x))
}
assert.Equal(t, n, len(charList))
chars := strings.Join(charList[:], ",")

actual := dmp.DiffCharsToLines([]Diff{Diff{DiffDelete, chars}}, lineList)
actual := dmp.DiffCharsToLines([]Diff{Diff{DiffDelete, string(charList)}}, lineList)
assert.Equal(t, []Diff{Diff{DiffDelete, strings.Join(lineList, "")}}, actual)
}

Expand Down Expand Up @@ -1419,16 +1418,12 @@ func TestDiffMainWithTimeout(t *testing.T) {
}

func TestDiffMainWithCheckLines(t *testing.T) {
// Test cases must be at least 100 chars long to pass the cutoff.
type TestCase struct {
Text1 string
Text2 string
}

dmp := New()
dmp.DiffTimeout = 0

// Test cases must be at least 100 chars long to pass the cutoff.
for i, tc := range []TestCase{
tests := []TestCase{
{
"1234567890\n1234567890\n1234567890\n1234567890\n1234567890\n1234567890\n1234567890\n1234567890\n1234567890\n1234567890\n1234567890\n1234567890\n1234567890\n",
"abcdefghij\nabcdefghij\nabcdefghij\nabcdefghij\nabcdefghij\nabcdefghij\nabcdefghij\nabcdefghij\nabcdefghij\nabcdefghij\nabcdefghij\nabcdefghij\nabcdefghij\n",
Expand All @@ -1441,7 +1436,39 @@ func TestDiffMainWithCheckLines(t *testing.T) {
"1234567890\n1234567890\n1234567890\n1234567890\n1234567890\n1234567890\n1234567890\n1234567890\n1234567890\n1234567890\n1234567890\n1234567890\n1234567890\n",
"abcdefghij\n1234567890\n1234567890\n1234567890\nabcdefghij\n1234567890\n1234567890\n1234567890\nabcdefghij\n1234567890\n1234567890\n1234567890\nabcdefghij\n",
},
} {
}
// Add test case for the issue #115
func() {
const before = `package main

import (
"fmt"
)

/*
func upload(c echo.Context) error {
if err := r.ParseForm(); err != nil {
fmt.Fprintf(w, "ParseForm() err: %v", err)
return
}
fmt.Fprintf(w, "POST request successful")
path_ver := r.FormValue("path_ver")
ukclin_ver := r.FormValue("ukclin_ver")

fmt.Fprintf(w, "Name = %s\n", path_ver)
fmt.Fprintf(w, "Address = %s\n", ukclin_ver)
}
*/
`
after := strings.ReplaceAll(before, "\n", "\r\n")

tests = append(tests, TestCase{Text1: before, Text2: after})
}()

dmp := New()
dmp.DiffTimeout = 0

for i, tc := range tests {
resultWithoutCheckLines := dmp.DiffMain(tc.Text1, tc.Text2, false)
resultWithCheckLines := dmp.DiffMain(tc.Text1, tc.Text2, true)

Expand Down
18 changes: 0 additions & 18 deletions diffmatchpatch/stringutil.go
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,6 @@
package diffmatchpatch

import (
"strconv"
"strings"
"unicode/utf8"
)
Expand Down Expand Up @@ -87,20 +86,3 @@ func runesIndex(r1, r2 []rune) int {
}
return -1
}

func intArrayToString(ns []uint32) string {
if len(ns) == 0 {
return ""
}

indexSeparator := IndexSeparator[0]

// Appr. 3 chars per num plus the comma.
b := []byte{}
for _, n := range ns {
b = strconv.AppendInt(b, int64(n), 10)
b = append(b, indexSeparator)
}
b = b[:len(b)-1]
return string(b)
}