Skip to content

scripts: add cherry-pick verification tool with fuzzy matching #10034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
223 changes: 223 additions & 0 deletions scripts/fuzzy-match-release-branch.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,223 @@
#!/usr/bin/env bash

set -euo pipefail

SRC_BRANCH=""
RELEASE_BRANCH=""
SRC_SCAN_LIMIT=1000
RELEASE_LIMIT=0

show_help() {
echo "Usage: $0 --source <branch> --release <branch> [--scan-limit N] [--limit N]"
echo ""
echo " --source Branch where cherry-picks originated (e.g. master)"
echo " --release Branch where cherry-picks landed (e.g. release-rc1)"
echo " --scan-limit Max commits to scan in source branch (default: 1000)"
echo " --limit Number of release commits to compare (default: all)"
exit 1
}

normalize_patch() {
sed '/^index [0-9a-f]\{7,\}\.\.[0-9a-f]\{7,\} [0-9]\{6\}$/d'
}

# Parse args
while [[ $# -gt 0 ]]; do
case "$1" in
--source|--release|--scan-limit|--limit)
if [[ -z "${2:-}" || "$2" =~ ^- ]]; then
echo "Error: Missing value for argument $1" >&2
show_help
fi
case "$1" in
--source) SRC_BRANCH="$2" ;;
--release) RELEASE_BRANCH="$2" ;;
--scan-limit) SRC_SCAN_LIMIT="$2" ;;
--limit) RELEASE_LIMIT="$2" ;;
esac
shift 2
;;
-h|--help) show_help ;;
*) echo "Unknown argument: $1"; show_help ;;
esac
done

if [[ -z "$SRC_BRANCH" || -z "$RELEASE_BRANCH" ]]; then
echo "❌ Missing required arguments."; show_help
fi

# Cross-platform hashing
hash_patch() {
if command -v md5sum >/dev/null 2>&1; then
md5sum | awk '{print $1}'
else
md5 | awk '{print $NF}'
fi
}
Comment on lines +51 to +56

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The script uses md5sum or md5 for hashing. MD5 is considered cryptographically broken and should not be used for security-sensitive applications. While this script isn't directly security-sensitive, consider using SHA256 (sha256sum) for better collision resistance, if available, or at least warn the user that MD5 is being used and may not be reliable for large numbers of commits. If sha256sum is not available, the script should exit with an error message.

Suggested change
if command -v md5sum >/dev/null 2>&1; then
md5sum | awk '{print $1}'
else
md5 | awk '{print $NF}'
fi
}
if command -v sha256sum >/dev/null 2>&1; then
sha256sum | awk '{print $1}'
elif command -v md5sum >/dev/null 2>&1; then
echo "WARNING: Using md5sum for hashing. Consider using sha256sum if available." >&2
md5sum | awk '{print $1}'
elif command -v md5 >/dev/null 2>&1; then
echo "WARNING: Using md5 for hashing. Consider using sha256sum if available." >&2
md5 | awk '{print $NF}'
else
echo "ERROR: No hashing program found (sha256sum, md5sum, or md5)." >&2
exit 1
fi

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this specific use case md5 is fine.


echo "🔍 Preparing comparison:"
echo " Source branch : $SRC_BRANCH"
echo " Release branch : $RELEASE_BRANCH"
echo " Max source scan: $SRC_SCAN_LIMIT"
echo " Max release compare: $([[ $RELEASE_LIMIT -gt 0 ]] && echo \"$RELEASE_LIMIT\" || echo \"ALL\")"
echo ""

echo "🔄 Fetching latest refs..."
git fetch --all --quiet || true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using || true after git fetch will suppress any errors. If the fetch fails (e.g., due to network issues), the script will continue with potentially stale data, leading to incorrect results. It's safer to remove || true and let the script exit if the fetch fails, ensuring it always operates on complete data.

Suggested change
git fetch --all --quiet || true
git fetch --all --quiet


echo "📥 Collecting release commits..."
RELEASE_COMMITS=$(git rev-list --no-merges "$RELEASE_BRANCH" ^"$SRC_BRANCH")
if [[ "$RELEASE_LIMIT" -gt 0 ]]; then
RELEASE_COMMITS=$(echo "$RELEASE_COMMITS" | head -n "$RELEASE_LIMIT")
fi
RELEASE_COMMITS=$(echo "$RELEASE_COMMITS" | awk '{ lines[NR] = $0 } END { for (i = NR; i > 0; i--) print lines[i] }')
RELEASE_COMMITS_ARRAY=()
while IFS= read -r line; do
[[ -n "$line" ]] && RELEASE_COMMITS_ARRAY+=("$line")
done <<< "$RELEASE_COMMITS"
echo " → Found ${#RELEASE_COMMITS_ARRAY[@]} release commits."

if [[ "${#RELEASE_COMMITS_ARRAY[@]}" -eq 0 ]]; then
echo "❌ No release commits found. Exiting."
exit 1
fi

echo "📥 Collecting source commits..."
SRC_COMMITS=$(git rev-list --no-merges --max-count="$SRC_SCAN_LIMIT" "$SRC_BRANCH")
SRC_COMMITS_ARRAY=()
while IFS= read -r line; do
[[ -n "$line" ]] && SRC_COMMITS_ARRAY+=("$line")
done <<< "$SRC_COMMITS"
echo " → Found ${#SRC_COMMITS_ARRAY[@]} source commits to scan."
echo ""

echo "⚙️ Indexing source commit metadata..."
echo " → Processing ${#SRC_COMMITS_ARRAY[@]} commits from $SRC_BRANCH..."
SRC_COMMIT_META=()
SRC_PATCH_HASHES=()
SRC_PATCHES=()

progress=0
for commit in "${SRC_COMMITS_ARRAY[@]}"; do
progress=$((progress + 1))
echo -ne "\r [$progress/${#SRC_COMMITS_ARRAY[@]}] Indexing $commit"
author=$(git log -1 --pretty=format:"%an <%ae>" "$commit" 2>/dev/null) || continue
subject=$(git log -1 --pretty=format:"%s" "$commit" 2>/dev/null) || continue
authordate=$(git log -1 --pretty=format:"%ai" "$commit" 2>/dev/null) || continue
meta_key="${subject}__${author}__${authordate}"
Comment on lines +104 to +107

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Spawning three separate git log processes for each commit is inefficient. This can be combined into a single git log call. Also, using __ as a delimiter is not robust, as commit subjects or author names may contain this sequence, causing parsing issues later on. Consider using a single git log call with a more robust delimiter like the unit separator (%x1f).

Suggested change
author=$(git log -1 --pretty=format:"%an <%ae>" "$commit" 2>/dev/null) || continue
subject=$(git log -1 --pretty=format:"%s" "$commit" 2>/dev/null) || continue
authordate=$(git log -1 --pretty=format:"%ai" "$commit" 2>/dev/null) || continue
meta_key="${subject}__${author}__${authordate}"
info=$(git log -1 --pretty=format:"%s%x1f%an <%ae>%x1f%ai" "$commit" 2>/dev/null) || continue
IFS=$'\x1f' read -r subject author authordate <<< "$info"
meta_key="$info" # Use the raw, delimited output as the meta key

patch=$(git show --format= --unified=3 "$commit" | normalize_patch | sed 's/^[[:space:]]*//')
patch_hash=$(echo "$patch" | hash_patch)

SRC_COMMIT_META+=("$meta_key")
SRC_PATCH_HASHES+=("$patch_hash")
SRC_PATCHES+=("$patch")
done

echo -e "\n → Completed source indexing."

TOTAL=${#RELEASE_COMMITS_ARRAY[@]}
MATCHED=0
UNMATCHED=0

for i in "${!RELEASE_COMMITS_ARRAY[@]}"; do
rc_commit="${RELEASE_COMMITS_ARRAY[$i]}"
rc_author=$(git log -1 --pretty=format:"%an <%ae>" "$rc_commit" 2>/dev/null) || continue
rc_subject=$(git log -1 --pretty=format:"%s" "$rc_commit" 2>/dev/null) || continue
rc_authordate=$(git log -1 --pretty=format:"%ai" "$rc_commit" 2>/dev/null) || continue
meta_key="${rc_subject}__${rc_author}__${rc_authordate}"

echo -ne "[$((i + 1))/$TOTAL] Checking ${rc_commit:0:7}... "

rc_patch=$(git show --format= --unified=3 "$rc_commit" | normalize_patch | sed 's/^[[:space:]]*//')
rc_patch_hash=$(echo "$rc_patch" | hash_patch)

found_exact_index=-1
for j in "${!SRC_PATCH_HASHES[@]}"; do
if [[ "${SRC_PATCH_HASHES[$j]}" == "$rc_patch_hash" ]]; then
found_exact_index=$j
break
fi
done

if [[ $found_exact_index -ne -1 ]]; then
found_exact="${SRC_COMMITS_ARRAY[$found_exact_index]}"
meta_info="${SRC_COMMIT_META[$found_exact_index]}"
src_subject="${meta_info%%__*}"
rest="${meta_info#*__}"
src_author="${rest%%__*}"
src_authordate="${rest##*__}"
Comment on lines +145 to +148

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The metadata parsing logic splits on __, which is fragile and will fail if the delimiter appears in the commit subject or author name. If you adopt the suggestion of using a robust delimiter like \x1f (unit separator), update the parsing logic here to match.

Suggested change
src_subject="${meta_info%%__*}"
rest="${meta_info#*__}"
src_author="${rest%%__*}"
src_authordate="${rest##*__}"
IFS=$'\x1f' read -r src_subject src_author src_authordate <<< "$meta_info" # Parse metadata using the robust delimiter

echo "✅ MATCHES ${found_exact:0:7}"
echo " ↪ RELEASE: $rc_commit"
echo " Author : $rc_author"
echo " Date : $rc_authordate"
echo " Subject: \"$rc_subject\""
echo " ↪ SOURCE : $found_exact"
echo " Author : $src_author"
echo " Date : $src_authordate"
echo " Subject: \"$src_subject\""
echo ""
MATCHED=$((MATCHED + 1))
continue
fi

echo "❌ NO MATCH"
UNMATCHED=$((UNMATCHED + 1))

echo "🔍 Unmatched Commit:"
echo " ↪ Commit : $rc_commit"
echo " ↪ Author : $rc_author"
echo " ↪ Subject: \"$rc_subject\""
echo ""

best_score=99999
best_index=""
fuzzy_candidates=0

for j in "${!SRC_COMMIT_META[@]}"; do
if [[ "${SRC_COMMIT_META[$j]}" == "$meta_key" ]]; then

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fuzzy matching logic requires the author, subject, and author date to be identical to find a candidate. This is stricter than intended. Matching on authordate can be too strict, especially for rebased commits where the author date might be updated. Consider making the match less strict by not comparing the date.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script compares the entire SRC_COMMIT_META[$j] with $meta_key. This is prone to errors if there are slight variations in the author or date format. It's better to compare the individual components (subject, author, date) separately to allow for minor discrepancies.

Suggested change
if [[ "${SRC_COMMIT_META[$j]}" == "$meta_key" ]]; then
rc_subject_base=$(echo "$meta_key" | cut -d'_' -f1)
src_subject_base=$(echo "${SRC_COMMIT_META[$j]}" | cut -d'_' -f1)
if [[ "$rc_subject_base" == "$src_subject_base" ]]; then

((fuzzy_candidates++))
diff=$(diff -u <(echo "$rc_patch") <(echo "${SRC_PATCHES[$j]}") || true)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The diff command is used without specifying a character encoding. This can lead to incorrect results if the source and release branches use different encodings. Specify the encoding explicitly using the -a option to treat all files as text and force a line-by-line comparison.

Suggested change
diff=$(diff -u <(echo "$rc_patch") <(echo "${SRC_PATCHES[$j]}") || true)
diff=$(diff -au <(echo "$rc_patch") <(echo "${SRC_PATCHES[$j]}") || true)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encodings are expected to be the same.

score=$(echo "$diff" | grep -vE '^(--- |\+\+\+ )' | grep -c '^[-+]')
if [[ "$score" -lt "$best_score" ]]; then
best_score=$score
best_index=$j
fi
Comment on lines +177 to +184

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The fuzzy matching logic compares commit metadata directly. This can lead to false positives if commit messages are duplicated across different commits. Consider incorporating the commit hash itself into the comparison to reduce the likelihood of false positives. This would involve adding the commit hash to the meta_key and using it in the comparison.

Suggested change
if [[ "${SRC_COMMIT_META[$j]}" == "$meta_key" ]]; then
((fuzzy_candidates++))
diff=$(diff -u <(echo "$rc_patch") <(echo "${SRC_PATCHES[$j]}") || true)
score=$(echo "$diff" | grep -vE '^(--- |\+\+\+ )' | grep -c '^[-+]')
if [[ "$score" -lt "$best_score" ]]; then
best_score=$score
best_index=$j
fi
meta_key="${subject}__${author}__${authordate}__${commit}"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit hash is not expected to be the same.

fi
done

if [[ "$fuzzy_candidates" -eq 0 ]]; then
echo "⚠️ No commits with matching author + subject + date in source branch."
else
match_commit="${SRC_COMMITS_ARRAY[$best_index]}"
match_author=$(git log -1 --pretty=format:"%an <%ae>" "$match_commit")
match_subject=$(git log -1 --pretty=format:"%s" "$match_commit")
Comment on lines +192 to +193

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These git log calls are redundant. The author and subject for the match_commit have already been fetched and are available in the ${SRC_COMMIT_META[$best_index]} variable. Parse this information directly from the metadata array instead.

Suggested change
match_author=$(git log -1 --pretty=format:"%an <%ae>" "$match_commit")
match_subject=$(git log -1 --pretty=format:"%s" "$match_commit")
meta_info="${SRC_COMMIT_META[$best_index]}"
IFS=$'\x1f' read -r match_subject match_author _ <<< "$meta_info" # Parse metadata; discard date


changed_files=$(git show --pretty="" --name-only "$rc_commit")

echo "🤔 Closest fuzzy match: $match_commit ($best_score changed lines from $fuzzy_candidates candidates)"
echo " ↪ Author : $match_author"
echo " ↪ Subject: \"$match_subject\""
echo " ↪ Files Changed:"
echo "$changed_files" | sed 's/^/ - /'
echo ""

echo "🔧 Check it manually (patch diff):"
echo " git diff $match_commit $rc_commit -- \$(git show --pretty=\"\" --name-only $rc_commit)"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The git diff command includes a || true at the end. This prevents the script from exiting if the diff command fails (e.g., if the files have been deleted or renamed). However, it also masks potential errors that could indicate a problem with the comparison. Consider removing || true and handling the error explicitly, or logging the error message for debugging purposes.

Suggested change
echo " git diff $match_commit $rc_commit -- \$(git show --pretty=\"\" --name-only $rc_commit)"
git diff "$match_commit" "$rc_commit" -- $changed_files | sed 's/^/ /'

echo ""

echo "🔍 Diff between release and closest match:"
echo "---------------------------------------------"
git diff "$match_commit" "$rc_commit" -- $changed_files | sed 's/^/ /' || true
echo "---------------------------------------------"
echo ""
fi

done

# Summary
echo ""
echo "🔎 Summary:"
echo " ✅ Matched : $MATCHED"
echo " ❌ Unmatched : $UNMATCHED"
echo " 📦 Total : $TOTAL"

Loading