It certainly wasn't the fastest general-purpose way to do copying for the decades of CPUs between the 8086 (late 1970s) and introduction of enhanced rep movsb (ERMSB) in Ivybridge (2012). For post-Ivybridge CPUs, I believe it was non optimal for some usage patterns for a while, too. It may be optimal in very recent Intel CPUs, but that's a far cry from saying it's been the fastest since introduction in the 8086.
Right. The follow-on to ERMSB, FSRM ("fast short rep movs"), which first appeared in Icelake, finally makes it consistently competitive with SW sequences¹.
¹ but you still want to branch around it when the length is zero ("when the length is zero?!" I hear you cry; it turns out that if you instrument actual systems, you will find that this happens shockingly often).
It depends on the exact distribution of sizes (and especially if the size[s] are statically knowable—e.g. if you are copying exactly 31 bytes, or something like an unknown size between 48 and 62 bytes, a SW sequence will still win), but it is now _competitive_ if not actually as fast (previously it was often 2-3x slower in that range, even when the length was not fixed).
E.g., Agner: https://www.agner.org/optimize/blog/read.php?i=372
E.g., Stackoverflow: https://stackoverflow.com/questions/43343231/enhanced-rep-mo...