This misses the point of my comment. When you put faith in malloc, you're putting hope in a lot of heuristics that may or may not degenerate for your particular workload. Windows is an outlier with how bad it is, but that should largely be irrelevant because the code should have already been insulated from the system allocator anyway.
An over-dependence on malloc is one of the first places I look when optimizing old C++ codebases, even on Linux and Darwin. Degradation on Linux + macOS is still there, but more insidious because the default is so good that simple apps don't see it.
Except that I'd guess that there is no "good" case in the case for MSVCRT's malloc. You shouldn't assume malloc is free, but you should also be able to assume it won't be horrifyingly slow. Just as much as you should be able to rely on "x*y" not compiling to an addition loop over 0..y (which might indeed be very fast when y is 0).
Yes, this unfortunately isn't the reality MSVCRT is in, but it is quite a reasonable expectation.
It's unreasonable to assume that an stdlib must be designed around performance to any capacity. For most software, the priorities for the stdlib are 1) existing, 2) being bug/vulnerability free, and likely, in the Windows case given Microsoft's tradition, 3) being functionally identical to the version they shipped originally. Linux and macOS have much more flexibility to choose a different set of priorities (the former, through ecosystem competition and the latter through a willingness to break applications and a dependence on malloc for objc), so it's not at all a fair comparison. The fact that malloc doesn't return null all the time is a miracle enough for many embedded platforms, for example, so it's not exclusively a Windows concern. Environments emphasizing security in particular might be even slower.
Multiplication is not a great argument... There's a long history of hardware that doesn't have multipliers. Would I complain about that hardware being bad? No, because I'd take a step back and ask what their priorities were and accept that different hardware has different priorities so I should be prepared to not depend on them. Same thing with standard libraries. You can't always assume the default allocator smiles kindly on your application.
I don't see a reason for the stdlib to be considered in a different way from the base language is all I'm saying. For most C programmers, the distinction between the stdlib and the base language isn't even a consideration. Thinking most software doesn't heavily rely on malloc (and the rest of the stdlib) being fast is stupid.
Even on hardware without a multiplier you'd do a shift-based version, with log_2(max_value) iterations. What's unreasonable is "for (int i = 0; i < y; i++) res+= x;". If there truly were no way to do a shift, then, sure, I'd accept the loop; but I definitely would be pretty mad at a compiler if it generated a loop for multiplication on x86_64. And I think it's reasonable to be mad at the stdlib being purposefully outdated too (even if there is a (bad) reason for it).
C and C++ are some of the few languages where the spec goes out of its way to not depend on an allocator, for good reason, and this is well after you've accounted for the majority of the code that, hopefully, doesn't need to do memory allocation at all. The fact that many programmers don't care is an indication that most code in most C or C++ software is not written with performance in mind. And that's (sometimes) fine. LLVM has a good ecosystem reason to use C++, for example, and it's well known in the compiler space that LLVM is not fast. Less recently, for a long time C and C++ were considered high level languages, meaning lots of software was written in it without consideration of performance. But criticizing the default implementation's performance absent a discussion of its priorities when you have all the power to not be bottlenecked in it anyway is just silly.
The fact that you should avoid allocation when possible has absolutely nothing to do with how fast allocation should be when you need it. And code not written with performance in mind should still be as fast as reasonably possible by default.
I would assume that quite a few people actually trying to write fast code would just assume that malloc, being provided to you by your OS, would be in the best position to know how to be fast. Certainly microsoft has the resources to optimize the two most frequently invoked functions in most C/C++ codebases, at least more than you yourself would.
MSVCRT being stuck with the current extremely slow thing, even if there are truly good reasonable reasons, is still a horrible situation to be in.
There isn't really a "system malloc on Linux". Many distributions come with the GNU allocator based on ptmalloc2, but there is no particular reason that a distro could not come out of the box with any other allocator. The world's most widespread Linux distribution uses LLVM's Scudo allocator. Alpine Linux comes with musl's (unbelievably slow) allocator, although it is possible to rebuild it with mimalloc.