Detecting duplicate images - Part 2: The naive way

Let’s start by defining some data types we can use for output. Go has some great standard library support for JSON encoding which makes this pretty painless, and if we optimize for this encoding, most other encoding packages will be easy to implement.

In order to achieve our goal, the minimum we need to know is:

What the two images we’re comparing are, and
How different they are.

Let’s define a Subject type and a Comparison type.

type Subject struct {
	Filename string
}

type Comparison struct {
	Distance uint
	SubjectA *Subject
	SubjectB *Subject
}

To start out, we’ll assume all Subjects are local files. It’s useful to abstract this away now at this step, though, since we can keep the notion of a Subject later as we add other sources or interfaces.

What does the Distance member of Comparison mean? How different is a value of 1 versus a value of 65536? For now, we won’t make any concrete assertions, except to say that a larger value indicates a larger difference, and relatively small values indicates “similar”. The only value we can define concretely is to say that 0 means “identical”.

Of course we will need to track all of our Subjects, so let’s make a type for that.

type SubjectList struct {
	list []*Subject
}

func NewSubjectList() *SubjectList {
	// Assume for now we'll have the same number of subjects as arguments
	return &SubjectList{list: make([]Subject, 0, flag.NArgs())}
}

func (s *SubjectList) AddFile(path string) error {
	s.list = append(s.list, &Subject{Filename: path})
	return nil
}

How do we specify Subjects? Let’s start by passing filenames to an invocation of our program. Once again the standard library makes our lives easier by providing the flag package. Using flag now lets us add command-line options later. Also, for the sake of convenience, let’s allow the user to pass in a list of files, or directories, or a combination of both. I like to handle this with the path/filepath package.

func main() {
	flag.Parse()
	sl := NewSubjectList()
	for _, fn := range flag.Args() {
		filepath.Walk(fn, sl.filewalkfunc())
	}
}

func (s *SubjectList) filewalkfunc() func(p string, info os.FileInfo, err error) error {
	return func(p string, info os.FileInfo, err error) error {
		if err != nil {
			return err
		}
		if info.IsDir() {
			return nil
		}
		return s.AddFile(p)
	}
}

If filewalkfunc looks weird to you, don’t worry, it is a little weird. Because filepath.Walk expects a function, we need some way to pass our list object along with that function. We could create a global variable and reference it from our function, but some smart Go folks believe this is harmful. Besides, writing tests with globals (Go calls them Package Scoped Variables, because that’s precisely what they are) is pretty painful. Anyway, filewalkfunc returns a function that closes around a pointer to our SubjectList. That’s the call to s.AddFile. The outer function definition does look pretty weird, but it makes sense when you remember Go funcs are a first-class type.

Now that we have gathered up all of our Subjects, it’s time to actually do something with them. Our strategy for this first (naive!) version will be to iterate through our list and compare each image with every other image. We’ll first define the nice abstract way we want this to happen:

func (s *SubjectList) CompareAll() []Comparison {
	// we know that a list of size N will have ((N-1)*N)/2 comparisons. this
	// relationship is the "Triangular numbers"! https://oeis.org/A000217
	r := make([]Comparison, 0, ((len(s.list)-1)*len(s.list))/2)
	for len(s.list) > 0 {
		// take the last subject
		subjecta := s.list[len(s.list)-1]
		// modify the slice to exclude it
		s.list = s.list[:len(s.list)-1]
		// now, compare all the remaining subjects
		for _, subjectb := range s.list {
			r = append(r, subjecta.CompareTo(subjectb))
		}
	}
	return r
}

In the next article, we will implement Subject.CompareTo(). We will need to address two core problems: how we compare two pictures that might not be the same size, and how we quantify the difference between pixels in two images.

See the complete code for this article on GitHub.