There are still blatant failure modes, when models engage into clear sycophancy,...

There are still blatant failure modes, when models engage into clear sycophancy, rather than expressing enthusiasm, etc.

I'd guess, in practice a benchmark (like this vibesbench), that could help catching unhelpful and blatant sycophancy fails may help.