No. Each expert is not separately trained, and while they may store different co...

woah · on April 30, 2025

Why do you think that a hand-configured selection between "different domains" is better than the training-based approach in MoE?

oofbaroomf · on May 1, 2025

First off, they are basically completely different technologies, so it would be disingenuous to act like it's an apples-to-apples comparison.

But a simple way to see it is that when you pick between multiple large models that have different strengths, you have a larger amount of parameters just to work with (e.g. Deepseek R1 + V3 + Qwen + LLaMA ends up being 2 trillion total parameters to pick from), whereas "picking" the experts in an MoE like has a smaller amount of total different parameters you are working with (e.g. R1 is 671 billion, Qwen is 235).

retinaros · on April 30, 2025

That might already happen behind what they call test time compute

oofbaroomf · on May 1, 2025

Many models that use test time compute are MoEs, but test-time compute is generally meant to refer to reasoning about the prompt/problem the model is given, not about reasoning about which model to pick, and I don't think anyone has released an LLM router under that name.

retinaros · on May 1, 2025

we dont know what OAI does to find the best answer when reasoning but I am pretty sure that having variations of a same model is part of it.