flak rss random

removing array duplicates

I had an array with some duplicates. I wanted to remove them. I know how to do this, but I searched for solutions anyway to make sure I wasn’t missing some trick. The results were disappointing, very language specific, and rarely discussed run time. And if we’re working with an unsorted array, the provided answers are even worse. Just sort the array first. Well, duh; any problem with unsorted data can be transformed into a problem with sorted data by sorting first. That’s not very interesting, though, and maybe there’s a reason the data is unsorted. Here’s a few solutions I worked through, but no stunning algorithmic breakthroughs.

Let’s start with problem description and clarification. We have a list of restaurant patrons, in the order they’ve visited. Some patrons have visited more than once, however, and we’d like to dedupe the data. However, we can’t simply sort the list by name, because we don’t want to forget the fact that Mr. Zamboni was our very first customer. To clarify the problem description, the array is ordered, but not according to the same criteria we will use for duplicate elimination. Ordered, but not sorted? Note that it is a list of patrons, not of visits, so we can’t necessarily sort by name, dedupe, then resort by time, because there is no time field. But we’ll see.

setup

We’d like solutions that conserve time and space. All solutions here work in place, using the tombstone technique to create a sparse array, then compacting it later.

Our input data. I’m using capitalization to help demonstrate that we are correctly selecting the first entry. Comparisons will be done using case folding.

    array := []string {
        "apple", "cAt", "cat", "Dog", "apple", "dog", "Cat",
        "Apple", "dOg", "banana", "cat", "dog", "apple",
    }

The expected output:

apple
cAt
Dog
banana

These examples are written in neutral pseudocode that should be adaptable to most any imperative language. Any resemblance to a particular existent language is purely coincidental.

one

This is the simple approach. Walk the array, then scan the remainder for duplicates.

func dedupe1(array []string) {
    for n, s := range array {
        for i := n + 1; i < len(array); i++ {
            if strings.EqualFold(s, array[i]) {
                array[i] = ""
            }
        }
    }
}

No space overhead. n^2 time. It’s simple, and probably works well enough for some cases.

two

There’s one practical optimization we can make. Don’t scan forward if we’ve already tombstoned this entry. This will be a fair bit faster if there are lots of duplicates.

func dedupe2(array []string) {
    for n, s := range array {
        if s != "" {
            for i := n + 1; i < len(array); i++ {
                if strings.EqualFold(s, array[i]) {
                    array[i] = ""
                }
            }
        }
    }
}

Still n^2 though. We can further refine this by counting known tombstones, and stopping the scan when we know there’s no more interesting data to be found, but diminishing returns for increasing effort.

three

Find a way to sort the data, but without actually sorting it and losing the initial order. Create a new array of decorated elements, sort that, and then use the sorted indices to walk the original array.

type IndexedString struct {
    i int
    s string
}
type IndexedArray []IndexedString
func (aa IndexedArray) Len() int { return len(aa) }
func (aa IndexedArray) Less(i, j int) bool {
    if strings.EqualFold(aa[i].s, aa[j].s) {
        return aa[i].i < aa[j].i
    }
    return strings.ToLower(aa[i].s) < strings.ToLower(aa[j].s)
}
func (aa IndexedArray) Swap(i, j int) { aa[i], aa[j] = aa[j], aa[i] }
func dedupe3(array []string) {
    decorated := make(IndexedArray, len(array))
    for i, s := range array {
        decorated[i] = IndexedString { i, s }
    }
    sort.Sort(decorated)
    var x string
    for _, s := range decorated {
        if strings.EqualFold(s.s, x) {
            array[s.i] = ""
        } else {
            x = s.s
        }
    }
}

Run time is n lg n, but we require additional space proportional to the input. This technique will work in just about any language.

four

If we are using a language which allows sorting with closures, we can simplify things by only using an additional array of indices.

func dedupe4(array []string) {
    indices := make([]int, len(array))
    for i := range array {
        indices[i] = i
    }
    sort.Slice(indices, func(i, j int) bool {
        x, y := indices[i], indices[j]
        if strings.EqualFold(array[x], array[y]) {
            return x < y
        }
        return strings.ToLower(array[x]) < strings.ToLower(array[y])
    })
    var x string
    for _, i := range indices {
        if strings.EqualFold(array[i], x) {
            array[i] = ""
        } else {
            x = array[i]
        }
    }
}

Same performance as before, but I find this version a little easier to work with.

alt

The alternative which shall not be named is using a searchable data structure to record seen entries. For trivial examples this works well, but for more complicated types with complex keys, it may require creating new types (see version three). The memory bound is harder to ascertain, but it’s hard to beat one int per element (see version four).

Posted 11 Apr 2019 19:30 by tedu Updated: 11 Apr 2019 19:53
Tagged: programming