The original paper [1] used LBFGS [2], it is quasi-second-order optimization algorithm.
[1] https://arxiv.org/pdf/2404.19756 - "Both MLPs and KANs are trained with LBFGS for 1800 steps in total."
[2] https://en.wikipedia.org/wiki/Limited-memory_BFGS
(Quasi-)Newton methods approximate learning rate using local curvature which gradient-based methods do not do.
The post relies on Tinygrad because it is familiar to author and author tinkers with batch size and learning rate, but not with optimizer itself.
I think that even line search for minimum on the direction of the batch gradient can provide most of the benefits of LBFGS. It is easy to implement.
This is why I mentioned batch gradient line search. You can combine it with conjugate gradient.
And small LeNet (I think it is first convolutional network that obtained good score on MNIST) is orders of magnitude bigger than KAN's in the original paper. And it will be, if we believe scaling claims from the KAN paper.
What's your basis for claiming that Tinygrad can't compute 2nd order partial derivatives (i.e. Hessians) needed for LBFGS? Tinygrad like PyTorch uses automatic differentiation which has no problem supporting nth order derivatives.
OP does not (seemingly) claim that tinygrad can't compute hessians, only that a first-order optimization method was the only thing tried.
Also, as a quasi-newton method, L-BFGS does not require explicit (pre-)computation of the hessian (it implicitly iteratively estimates its inverse in an online manner).
The post relies on Tinygrad because it is familiar to author and author tinkers with batch size and learning rate, but not with optimizer itself.
I think that even line search for minimum on the direction of the batch gradient can provide most of the benefits of LBFGS. It is easy to implement.