Hacker News new | ask | show | jobs
by krasin 777 days ago
Update2: got it to 100% training accuracy, 99% test accuracy with (2, 2, 2) shape.

Changes:

1. Increased the training set from 1000 to 100k samples. This solved overfitting.

2. In the dataset generation, slightly reduced noise (0.1 -> 0.07) so that classes don't overlap. With an overlap, naturally, it's impossible to hit 100%.

3. Most important & specific to KANs: train for 30 steps with grid=5 (5 segments for each activation function), then 30 steps with grid=10 (and initializing from the previous model), and then 30 steps with grid=20. This is idiomatic to KANs and covered in the Example_1_function_fitting.ipynb: https://github.com/KindXiaoming/pykan/blob/master/tutorials/...

Overall, my impressions are:

- it works!

- the reference implementation is very slow. A GPU implementation is dearly needed.

- it feels like it's a bit too non-linear and training is not as stable as it's with MLP + ReLU.

- Scaling is not guaranteed to work well. Really need to see if MNIST is possible to solve with this approach.

I will definitely keep an eye on this development.

2 comments

This makes me wonder what you could achieve if instead of iteratively growing the grid, or worrying about pruning or regularization, you governed network topology with some sort of evolutionary algorithm.
You can do much better by growing an AST with memoization and non-linear regression. So much so, the EVO folks gave a best paper to a non-EVO, deterministic algorithm at their conference

https://seminars.math.binghamton.edu/ComboSem/worm-chiu.pge_... (author)

Code for the curious: https://github.com/verdverm/pypge
Interesting, the use of grammar production rules reminds me of Grammatical Evolution[0], which has shown some promise in constraining the search space when using EAs for e.g. symbolic regression.

[0]: https://en.wikipedia.org/wiki/Grammatical_evolution

Much of what I did in my work was to reduce or constrain the search space.

1. Don't evolve constants or coefficients, use regression to find

2. Leverage associativity and commutativity, simplify with SymPy, sort operands to add/mul

So much effort in GP for SR is spent evaluating models which are effectively the same, even though their "DNA" is different. Computational effort, and algorithmic effort (to deal with loss of population diversity, i.e. premature convergence)

I've seen a few papers since pick up on the idea of local search operators, the simplification, and regression, trying to maintain the evolution aspect. Every algo ends up in local optima and works of effectively the same form by adding useless "DNA". I could see the PGE algo doing this too, going down a branch of the search space that did not add meaningful improvement. With the recent (~5y) advancements in AI, there are some interesting things to try

Believe there is a Google paper out there that tried that
1000s, there is a whole field and set of conferences. You can find more by searching "Genetic Programming" or "Symbolic Regression"

KAN, with the library of variables and math operators, very much resembles this family of algos, problems, and limitations. The lowest hanging fruit they usually leave on the proverbial tree is that you can use fast regression techniques for the constants and coefficients. No need to leave it up to random perturbations or gradient descent. What you really need to figure out is the form or shape of the model, rather than leaving it up to the human (in KAN)

> Increased the training set from 1000 to 100k samples. This solved overfitting.

Solved over fitting or created more? Even if your sets are completely disjoint with something like two moons the more data you have the lower the variance.