If you can do a good job predicting monkey visual cortex responses with a linear combination of units from a convolutional neural network, that implies that the network and the monkey compute similar nonlinear functions of visual inputs. If that's not evidence for similar representations, I'm curious what you think would be.
http://www.jneurosci.org/content/35/35/12127/F3.expansion.ht... also shows that confusion patterns are similar for convolutional neural networks, humans, and monkeys, but not low-level visual models.