Skip to content
Claudiu edited this page Jul 25, 2016 · 7 revisions

SIFT

SIFT, or Scale-Invariant Feature Transform, is a way to detect features that is invariant with respect to:

  • Scale
  • Orientation
  • Illumination
  • Viewpoint (?)

The Problem

The problem it's solving is, given we found interest points, how do we match the interest points? We need a descriptor, which has to be invariant so that the same descriptor for the same keypoint can be matched from different images, and yet also distinctive so each one is different from the next.

This comes from having just used the Harris operator to find corners.

Harris no good

Harris is no good because it is not rotation invariant, and it's not illumination invariant. We could correlate every Harris matrix with every other Harris matrix, but it would be slow (n**2) and also not invariant as per above. The normalized correlation is only invariant to linear illumination changes, and it's very sensitive to geometric changes, so it's no good either.

The Method

  • Use some method - be it original SIFT, or Harris-Laplace, or something else - to find key points. This gives you scale and localized it precisely. This is where the Difference of Gaussians approach comes in.

  • Assign orientations to the keypoints

    • Do this by finding the gradient and computing the dominant direction of the gradient

    • Build a histogram of angles of orientation into n bins (36?)

    • Just say the orientation is at the peak, the highest orientation:

      image

    • Now just re-adjust the orientation so that is the new north. Now any descriptor we build is invariant to x/y, to scale (because of the previous methods), and now to rotation.

    • NOTE: The histogram may be tied, but actually you smooth the histogram and pick the peak:

      image

  • Create a descriptor from the oriented keypoints

    • Normalize: take the window of whatever scale, rotate it to the common orientation, scale it to a uniform size (small scale --> blow it up, large scale --> shrink it down).
    • Take the 16x16 surrounding pixels
    • Compute gradient orientations at each pixel
    • Smooth them by a 2-d Gaussian (so the center orientations matter more)
      • After smoothing, clip the magnitude to some maximum, so any one vote doesn't count too much.
    • (Detail: Trilinear interpolation, so each pixel is actually counted in multiple bins a little. This gives you a very slow change if you move the descriptor just a little, which is what you want.)
    • Group into 4x4 boxes (16 boxes total)
    • In each box, take an 8-bin histogram of each gradient vector
    • So you have 128 numbers (8-bin * 16 = 128) for each descriptor
    • Stack these together
    • Normalize the vector to magnitude 1.0. It's more illumination invariant
    • Because it is stacked, this is a bit spatially sensitive - so what is in the top-left will want to be similar to another what is in the top-left.

Empirically Motivated Choices

The above numbers - 8 histogram bins, 4x4 descriptor size - were empirically motivated by running on a test set of images which they subjected to scaling, illumination, noise, rotation, affine transforms, etc. They picked those parameters that did best.

Tutorial Links

Clone this wiki locally