My understanding is that the Winograd minimal filtering algorithms used in cuDNN are different from the O(n^2.37) Coppersmith-Winograd-descended matrix multiplication algorithms. But I acknowledge that these can be considered cousins, produced by the same line of research.
I'm pretty sure there's no difference. It does seem to be pretty hard to turn the theoretical win into a practical one - the GPU kernel needs to be coded extremely efficiently to match the underlying hardware. AFAIK it's only a win for 3x3 - maybe for one other size too. Originally Winograd wasn't supported by cuDNN on NVidia's Tensor Cores (matmul-specific hardware on more recentish GPUs), vs CUDA cores, but a Google search seems to indicate it can be done - not sure if that's in cuDNN though.