The original implementation wrote to different locations in a shared slice.
While this is theoretically okay, we end up thrashing the cpu cache since
multiple slice members may be on the same cache line. So, even though each
thread has its own memory location, there may be contention over the cache
line. This changes the code to aggregate to a slice in a single goroutine.
In reality, this change likely won't have any performance impact. The theory
proposed above hasn't really even been tested. Either way, we can consider it
and possibly go forward.
Signed-off-by: Stephen J Day <stephen.day@docker.com>