Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The reason for exp(x) is that its derivative is exp(x), which makes it possible to express the gradient of s(x) in terms of s(x), or both in terms of exp(x). This simplifies the computation of backward pass.


I agree that "it has nice derivatives" is a great empirical reason to use a specific function in ML, but it doesn't sufficiently prove that it's the best function to use. And even if a derivative term looks more complex, that doesn't necessarily imply that it is more computationally expensive to compute, so that can't be the only criteria to select a function.

Luckily, there are more axiomatic reasons for why softmax is the preferred way to map inputs to a probability distribution.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: