# Novel Architecture Makes Neural Networks More Understandable

## Introduction

“Neural networks are currently the most powerful tools in artificial intelligence,” said Sebastian Wetzel, a researcher at the Perimeter Institute for Theoretical Physics. “When we scale them up to larger data sets, nothing can compete.”

And yet, all this time, neural networks have had a disadvantage. The basic building block of many of today’s successful networks is known as a multilayer perceptron, or MLP. But despite a string of successes, humans just can’t understand how networks built on these MLPs arrive at their conclusions, or whether there may be some underlying principle that explains those results. The amazing feats that neural networks perform, like those of a magician, are kept secret, hidden behind what’s commonly called a black box.

AI researchers have long wondered if it’s possible for a different kind of network to deliver similarly reliable results in a more transparent way.

An April 2024 study introduced an alternative neural network design, called a Kolmogorov-Arnold network (KAN), that is more transparent yet can also do almost everything a regular neural network can for a certain class of problems. It’s based on a mathematical idea from the mid-20th century that has been rediscovered and reconfigured for deployment in the deep learning era.

Although this innovation is just a few months old, the new design has already attracted widespread interest within research and coding communities. “KANs are more interpretable and may be particularly useful for scientific applications where they can extract scientific rules from data,” said Alan Yuille, a computer scientist at Johns Hopkins University. “[They’re] an exciting, novel alternative to the ubiquitous MLPs.” And researchers are already learning to make the most of their newfound powers.

**Fitting the Impossible **

A typical neural network works like this: Layers of artificial neurons (or nodes) connect to each other using artificial synapses (or edges). Information passes through each layer, where it is processed and transmitted to the next layer, until it eventually becomes an output. The edges are weighted, so that those with greater weights have more influence than others. During a period known as training, these weights are continually tweaked to get the network’s output closer and closer to the right answer.

A common objective for neural networks is to find a mathematical function, or curve, that best connects certain data points. The closer the network can get to that function, the better its predictions and the more accurate its results. If your neural network models some physical process, the output function will ideally represent an equation describing the physics — the equivalent of a physical law.

For MLPs, there’s a mathematical theorem that tells you how close a network can get to the best possible function. One consequence of this theorem is that an MLP cannot perfectly represent that function.

But KANs, in the right circumstances, can.

KANs go about function fitting — connecting the dots of the network’s output — in a fundamentally different way than MLPs. Instead of relying on edges with numerical weights, KANs use functions. These edge functions are nonlinear, meaning they can represent more complicated curves. They’re also learnable, so they can be tweaked with far greater sensitivity than the simple numerical weights of MLPs.

Yet for the past 35 years, KANs were thought to be fundamentally impractical. A 1989 paper co-authored by Tomaso Poggio, a physicist turned computational neuroscientist at the Massachusetts Institute of Technology, explicitly stated that the mathematical idea at the heart of a KAN is “irrelevant in the context of networks for learning.”

One of Poggio’s concerns goes back to the mathematical concept at the heart of a KAN. In 1957, the mathematicians Andrey Kolmogorov and Vladimir Arnold showed — in separate though complementary papers — that if you have a single mathematical function that uses many variables, you can transform it into a combination of many functions that each have a single variable.

There’s an important catch, however. The single-variable functions the theorem spits out might not be “smooth,” meaning they can have sharp edges like the vertex of a V. That’s a problem for any network that tries to use the theorem to re-create the multivariable function. The simpler, single-variable pieces need to be smooth so that they can learn to bend the right way during training, in order to match the target values.

So KANs looked like a dim prospect — until a cold day this past January, when Ziming Liu, a physics graduate student at MIT, decided to revisit the subject. He and his adviser, the MIT physicist Max Tegmark, had been working on making neural networks more understandable for scientific applications — hoping to offer a peek inside the black box — but things weren’t panning out. In an act of desperation, Liu decided to look into the Kolmogorov-Arnold theorem. “Why not just try it and see how it works, even if people hadn’t given it much attention it in the past?” he asked.

Tegmark was familiar with Poggio’s paper and thought the effort would lead to another dead end. But Liu was undeterred, and Tegmark soon came around. They recognized that even if the single-value functions generated by the theorem were not smooth, the network could still approximate them with smooth functions. They further understood that most of the functions we come across in science are smooth, which would make perfect (rather than approximate) representations potentially attainable. Liu didn’t want to abandon the idea without first giving it a try, knowing that software and hardware had advanced dramatically since Poggio’s paper came out 35 years ago. Many things are possible in 2024, computationally speaking, that were not even conceivable in 1989.

Liu worked on the idea for about a week, during which he developed some prototype KAN systems, all with two layers — the simplest possible networks, and the type researchers had focused on over the decades. Two-layer KANs seemed like the obvious choice because the Kolmogorov-Arnold theorem essentially provides a blueprint for such a structure. The theorem specifically breaks down the multivariable function into distinct sets of inner functions and outer functions. (These stand in for the activation functions along the edges that substitute for the weights in MLPs.) That arrangement lends itself naturally to a KAN structure with an inner and outer layer of neurons — a common arrangement for simple neural networks.

But to Liu’s dismay, none of his prototypes performed well on the science-related chores he had in mind. Tegmark then made a key suggestion: Why not try a KAN with more than two layers, which might be able to handle more sophisticated tasks?

That outside-the-box idea was the breakthrough they needed. Liu’s fledgling networks started showing promise, so the pair soon reached out to colleagues at MIT, the California Institute of Technology and Northeastern University. They wanted mathematicians on their team, plus experts in the areas they planned to have their KAN analyze.

In their April paper, the group showed that KANs with three layers were indeed possible, providing an example of a three-layer KAN that could exactly represent a function (whereas a two-layer KAN could not). And they didn’t stop there. The group has since experimented with up to six layers, and with each one, the network is able to align with a more complicated output function. “We found that we could stack as many layers as we want, essentially,” said Yixuan Wang, one of the co-authors.

**Proven Improvements**

The authors also turned their networks loose on two real-world problems. The first relates to a branch of mathematics called knot theory. In 2021, a team from DeepMind announced they’d built an MLP that could predict a certain topological property for a given knot after being fed enough of the knot’s other properties. Three years later, the new KAN duplicated that feat. Then it went further and showed how the predicted property was related to all the others — something, Liu said, that “MLPs can’t do at all.”

The second problem involves a phenomenon in condensed matter physics called Anderson localization. The goal was to predict the boundary at which a particular phase transition will occur, and then to determine the mathematical formula that describes that process. No MLP has ever been able to do this. Their KAN did.

But the biggest advantage that KANs hold over other forms of neural networks, and the principal motivation behind their recent development, Tegmark said, lies in their interpretability. In both of those examples, the KAN didn’t just spit out an answer; it provided an explanation. “What does it mean for something to be interpretable?” he asked. “If you give me some data, I will give you a formula you can write down on a T-shirt.”

The ability of KANs to do this, limited though it’s been so far, suggests that these networks could theoretically teach us something new about the world, said Brice Ménard, a physicist at Johns Hopkins who studies machine learning. “If the problem is actually described by a simple equation, the KAN network is pretty good at finding it,” he said. But he cautioned that the domain in which KANs work best is likely to be restricted to problems — such as those found in physics — where the equations tend to have very few variables.

Liu and Tegmark agree, but don’t see it as a drawback. “Almost all of the famous scientific formulas” — such as *E* = *mc*^{2} — “can be written in terms of functions of one or two variables,” Tegmark said. “The vast majority of calculations we do depend on one or two variables. KANs exploit that fact and look for solutions of that form.”

**The Ultimate Equations**

Liu and Tegmark’s KAN paper quickly caused a stir, garnering 75 citations within about three months. Soon other groups were working on their own KANs. A paper by Yizheng Wang of Tsinghua University and others that appeared online in June showed that their Kolmogorov-Arnold-informed neural network (KINN) “significantly outperforms” MLPs for solving partial differential equations (PDEs). That’s no small matter, Wang said: “PDEs are everywhere in science.”

A July paper by researchers at the National University of Singapore was more mixed. They concluded that KANs outperformed MLPs in tasks related to interpretability, but found that MLPs did better with computer vision and audio processing. The two networks were roughly equal at natural language processing and other machine learning tasks. For Liu, those results were not surprising, given that the original KAN group’s focus has always been on “science-related tasks,” where interpretability is the top priority.

Meanwhile, Liu is striving to make KANs more practical and easier to use. In August, he and his collaborators posted a new paper called “KAN 2.0,” which he described as “more like a user manual than a conventional paper.” This version is more user-friendly, Liu said, offering a tool for multiplication, among other features, that was lacking in the original model.

This type of network, he and his co-authors maintain, represents more than just a means to an end. KANs foster what the group calls “curiosity-driven science,” which complements the “application-driven science” that has long dominated machine learning. When observing the motion of celestial bodies, for example, application-driven researchers focus on predicting their future states, whereas curiosity-driven researchers hope to uncover the physics behind the motion. Through KANs, Liu hopes, researchers could get more out of neural networks than just help on an otherwise daunting computational problem. They might focus instead on simply gaining understanding for its own sake.