Co-routines are very useful and likely underused, but sometimes you are actually better off being able to pass the control to a given thread directly, other than having a scheduler involved.
Anecdote: almost a decade ago, I was responsible for an NVMe-like implementation (hard- and software). The 3rd version of the firmware recognized the various components as threads, but there was no need for preemption (which would require expensive locking). Traditional scheduling would work, but you actually know exactly which thread should execute next (hardware will signal done), so an explicit yield_to() was far cheaper and only slightly more expensive than a function call.
> Co-routines are very useful and likely underused, but sometimes you are actually better off being able to pass the control to a given thread directly, other than having a scheduler involved.
That's almost the very definition of a coroutine--explicit transfer of control. In symmetric coroutines you must specify a coroutine for both yield and resume; in asymmetric coroutines you specify what to resume to but yield implicitly returns to whatever resumed the current coroutine. In either case the actual control flow transfer is explicitly invoked.
The term thread is more ambiguous, but it almost always implies control transfers--both the timing and target of control transfer--are implicit and not directly exposed to application logic. (Automagic control transfer might be hidden within commonly used functions (e.g. read and write), injected by the compiler (Go does this), or triggered by hardware.)
You can synthesize a threading framework with both asymmetric and symmetric stackful[1] coroutines by simply overloading the resume and yield operations to transfer control to a scheduler, and then hiding implicit resume/yield points within commonly used functions or by machine translation of the code. In languages where "yield" and "resume" are exposed as regular functions this is especially trivial. Stackful coroutines (as opposed to stackless, which are the most commonly provided type of coroutine) are a powerful enough primitive that building threads is relatively trivial, which is why the concepts are easy to conflate, but they shouldn't be confused.
LISP-y languages blur some of these distinctions as libraries can easily rewrite code; they can inject implicit control transfer and stack management in unobtrusive ways.[2] This isn't possible to the same extent in languages like C, C++, or Rust; lacking a proper control flow primitive (i.e. stackful coroutine) their "threading" frameworks[3] are both syntactically and semantically leaky.
[1] By definition a thread preserves stack state--recursive function state--and this usually implies that stack management occurs at a very low-level in the execution environment, but in any case largely hidden from the logical application code.
[2] OTOH, this is usually inefficient--stack management is a very performance critical aspect of the runtime. For example, Guile, a Scheme implementation, now provides a stackful coroutine primitive. For a good discussion of some of these issues, see https://wingolog.org/archives/2017/06/27/growing-fibers
[3] Specifically the frameworks that attempt to make asynchronous I/O network programming simple and efficient. So-called native threads are a different matter as both stack management and control transfer are largely implemented outside the purview of those languages, very much like how native processes are implemented. If you go back far enough in the literature, especially before virtual memory, the distinctions between process and thread fall away. Nowadays threads are differentiated from processes by sharing the same memory/object space.
Yes, I agree, but it's not what the OP does, also, mostly often, co-routines gets conflated with cooperative threads/scheduling.
For some color to your other points: the previous version was actually using continuation passing style which worked and was very fast (faster than coroutines), but challenging to understand without a good background in FP and FP implementations.
Funny enough we actually started prototyping with Erlang for subsequent project (which was cancelled before it went far). Unfortunately I don't know enough to know what's special about the Erlang scheduler (if anything), but as I understand the Erlang concurrency model, it's mostly about not sharing memory (forces explicit communication). That's obviously going to eliminate a host of bugs, but it would have been way too expensive for the mentioned firmware.
Once we got started with Erlang I was pretty turned off. The pretty examples you see in tutorials aren't what you'll be using. Instead it's framework upon framework, far from elegance IMO. I was happy to not have to deal with that again. Today I'd probably choose Rust for the same task (static types FTW).
Given that Erlang is sufficiently different from anything else I've seen, it doesn't surprise me that trying to be productive in Erlang before you knew it well enough was suboptimal experience. I like it, and more specifically Elixir, quite a bit, but the learning curve was steep.
It's fair to say Erlang's strength or appeal is not the language itself but the platform you instruct with it. That's also where the learning curve is.
Anecdote: almost a decade ago, I was responsible for an NVMe-like implementation (hard- and software). The 3rd version of the firmware recognized the various components as threads, but there was no need for preemption (which would require expensive locking). Traditional scheduling would work, but you actually know exactly which thread should execute next (hardware will signal done), so an explicit yield_to() was far cheaper and only slightly more expensive than a function call.