Both sides have merit. The trick is to find a point in between that works for you. What I tend to do after having to optimize after the fact on numerous projects amounts to:
- write for clarity with an architecture that doesn't greatly impede performance
- have good habits, always use datastructures that work well at both small and larger scales
whenever readily available (e.g. hash tables, preallocating if size is known)
- think longer about larger decisions (e.g. choice of datastore and schema, communication between major parts)
- have some plans in mind if performance becomes an issue (e.g. upgrade instance sizes, number of instances)
and be aware if you are currently at a limit where there isn't a quick throw money at the problem next level
- measure and rewrite code only as necessary taking every opportunity to share both
why and how with as many team members as feasible
"write for clarity with an architecture that doesn't greatly impede performance"
I came here basically to say something similar to this. The most important metric is to have a design in the beginning that attempts to identify where the problems (Critical paths at a high level) are going to be and avoids them. That doesn't necessarily mean the initial versions are written to the final architecture, but that there is a plan for mutating the design along the way to minimize the overhead for the critical portions.
Nearly every application I've ever worked on has had some portion that we knew upfront was going to be the performace bottleneck. So we wrote those pieces in low level C/C++ generally really near (or in) the OS/Kernel, and then all the non performance critical portions in a higher level language/toolkit. This avoided many of the pitfalls I see in other projects that wrote everything in Java (or whatever) and the overhead of all the non-critical portions were interfering with the critial portions. In networking/storage its the split between the "data path" and the "control path", some other products I worked on had a "transactional path" and a control/reporting path.
Combined with unittests to validate algorithmic sections, frequently the behavior of the critical path could be predicted by a couple of the unittests, or other simple metrics (inter-cluster message latency/etc).