The soft max is the probability of the next token being whatever in the training data conditioned on the inputs. The author just doesn't know that apparently and thinks it was an arbitrary choice.
The author's essay on the sigmoid similarly lacks the deep understanding that it comes from somewhere and isn't an arbitrary choice.
Softmax isn't a loss function. It is used to transform model outputs into positive numbers that sum to 1, so that they can be interpreted as probabilities, and then those numbers are passed into (typically) the cross entropy loss function. I think you mean, which models are trained using some function other than softmax to transform the model outputs. There are a number of alternatives to softmax, such as the ones described here https://www.emergentmind.com/topics/sparsemax
They’re not. Cross entropy loss is E[-log q] where q is a probability. You could convert the model outputs x into probabilities using some other function like q = 1/Z x^2, and compute cross entropy loss just fine.
The author's essay on the sigmoid similarly lacks the deep understanding that it comes from somewhere and isn't an arbitrary choice.