The original paper [1] used LBFGS [2], it is quasi-second-order optimization alg...

abetusk · on July 2, 2024

You're saying that LBFGS is fundamental to the success of KANs? Why is this so?

thesz · on July 2, 2024

I can only state that KAN paper used LBFGS and that LBFGS is remarkably different from SGD.

One of the differences is a dynamic learning rate guided by approximation of the local curvature.

adw · on July 2, 2024

Will deal better with less well-conditioned parameters, maybe? That's a bit of a wild-ass guess, but it's not an entirely uninformed one.

buildbot · on July 2, 2024

And if it is, good luck scaling LBFGS to anything useable, like vgg-16 scale…let alone a 7B param LLM.

Back in grad school I tried to use LBFGS to optimize a small lenet network. I want to say it used over 128GB before OOM.

thesz · on July 2, 2024

This is why I mentioned batch gradient line search. You can combine it with conjugate gradient.

And small LeNet (I think it is first convolutional network that obtained good score on MNIST) is orders of magnitude bigger than KAN's in the original paper. And it will be, if we believe scaling claims from the KAN paper.

osipov · on July 2, 2024

What's your basis for claiming that Tinygrad can't compute 2nd order partial derivatives (i.e. Hessians) needed for LBFGS? Tinygrad like PyTorch uses automatic differentiation which has no problem supporting nth order derivatives.

fjkdlsjflkds · on July 2, 2024

OP does not (seemingly) claim that tinygrad can't compute hessians, only that a first-order optimization method was the only thing tried.

Also, as a quasi-newton method, L-BFGS does not require explicit (pre-)computation of the hessian (it implicitly iteratively estimates its inverse in an online manner).

thesz · on July 2, 2024

As someone with highly unpronoceable nickname said, my only complaint is that only first order methods are used.

Second order methods are fun, actually. I like them. ;)