The soft max is the probability of the next token being whatever in the training...

canjobear · 2026-05-01T15:39:18 1777649958

The softmax, after the network has been trained, yields an estimate of the probability in the training data, but it is not that probability itself.

jmalicki · 2026-05-01T15:57:11 1777651031

Which models are not trained with the log softmax as the loss function?

canjobear · 2026-05-01T16:09:57 1777651797

Softmax isn't a loss function. It is used to transform model outputs into positive numbers that sum to 1, so that they can be interpreted as probabilities, and then those numbers are passed into (typically) the cross entropy loss function. I think you mean, which models are trained using some function other than softmax to transform the model outputs. There are a number of alternatives to softmax, such as the ones described here https://www.emergentmind.com/topics/sparsemax

jmalicki · 2026-05-01T16:20:19 1777652419

The cross entropy loss function is softmax. They are one and the same.

canjobear · 2026-05-01T16:29:48 1777652988

They’re not. Cross entropy loss is E[-log q] where q is a probability. You could convert the model outputs x into probabilities using some other function like q = 1/Z x^2, and compute cross entropy loss just fine.

jmalicki · 2026-05-01T16:40:23 1777653623

Behold the softmax: https://docs.pytorch.org/docs/2.11/generated/torch.nn.CrossE...

canjobear · 2026-05-01T16:53:50 1777654430

Behold the actual definition of cross entropy: https://en.wikipedia.org/wiki/Cross-entropy

It's true that the PyTorch API conflates cross entropy and softmax, but they are separate concepts.