Hacker News new | ask | show | jobs
by bduerst 2617 days ago
>Deep learning and machine learning don’t work. Quantitative math will always prevail,

I have a neural net onboard my phone which automatically detects songs offline and tells me what they are. Is that semantically 'quantitative math' and not machine learning?

2 comments

Which app is that? Most of the well known music identification apps like Shazam use acoustic fingerprinting to identify songs. They work well without using neural nets or deep learning. What benefits does your neural net based app offer over this well known approach?

https://en.m.wikipedia.org/wiki/Acoustic_fingerprint

It uses both - the neural network creates it's own acoustic fingerprint database, which it then uses to perceive sound. This shrinks the traditional acoustic fingerprinting data enough to be stored on mobile devices, while allowing for a low-power "always-on" identification offline.

It's analogous to a human being able to identify songs by remembering the chorus, just that the NN uses it's own features for both the memory and offline perception.

>In 2017 we launched Now Playing on the Pixel 2, using deep neural networks to bring low-power, always-on music recognition to mobile devices. In developing Now Playing, our goal was to create a small, efficient music recognizer which requires a very small fingerprint for each track in the database, allowing music recognition to be run entirely on-device without an internet connection.

https://ai.googleblog.com/2018/09/googles-next-generation-mu...

The biggest issue I have is: "To do this we developed an entirely new system using convolutional neural networks to turn a few seconds of audio into a unique “fingerprint.”

Why did you pick a neural network? What mathematical properties does a neural network have that makes it appealing to this problem? How were the networks trained? Back propagation? It doesn't converge, and worse learning weights for a new batch can cause you to forget previous batches. This isn't a desirable property of neural networks or back propagation. You probably had a lot of heuristics on top, fine. How do you know that the weights you ended up with will always work in practise? Given an arbitrary track, you can encode it? What about growing the database? Does the neural network get updated for new songs, or do you use the same neural network to fingerprint new songs and update the data base?

Here's how I would have done it:

A song file is just a sequence of amplitudes. I would do some kind of an interpolation of piece-wise trig function. Trig functions have very desirable properties: they are continuous everywhere, and infinitely differentiable. Moreover, a sine basis decomposition will be able to reconstruct the original signal very well. This is great, because now you can use theories from DSP and fourier analysis. So we take the entire song, do a continuous time discrete cosine transform, in a block size of 32. Now you compute the square norm of all feature vectors, sort them, eliminate the vectors that are within 1e-3 radius (they are too similar to each other, there's not point in keeping them) and only store the top 25% of feature vectors by the square norm. The 25% cut off threshold and 1e-3 radius of similarity are heuristics, and adjustable parameters.

Now you have a database. For a new song, repeat the procedure, and get a feature vector for every 32 interval. There are probably theories in DSP you can use to get a better similarity measure, but for now, we'll just use the L2 norm of the difference. Do a nearest neighbour search in your data base for all feature vectors, and rank the results based on hits. I can run all of this on a computer from 2000s which are crappier than modern phones, and have the entire backend run on equally crappy hardware too. All parts of what I'm doing are fully deterministic, updating the DB is incredibly fast, CTDCT is super fast, there are no questions of convergence, no need for training. You can probably increase the accuracy and speed by doing some DSP and doing the nearest neighbour search based on different voice, bass, instrumental etc. features.

In practise how would it compare to your neural network? No idea, but I imagine it should be very competitive. The big benefits are that you have only 3 parameters (radius of similarity, cut off threshold and block size). This seems very easy to bench mark against, it should take like a week to implement. I'm not sure about the compression of the finger print however. Not sure how much space 1000000 songs will take (probably 25% since that was our cutoff). You can probably borrow psycho acoustics to make a better data base, and get a better compressed representation. Another alternative would be to down sample the song to 64kbps before hand.

I agree with spaced-out. A neural net can capture all those smaller eigenvectors in the signal that are routinely thrown away during traditional feature engineering, like what you describe. When the number of training samples grows big enough, those factors with marginal contribution become significant and allow higher levels of accuracy in prediction or classification than are possible when curating features manually.

Deep nets are here to stay. They're just not magic bullets that solve all problems equally well, especially those when training data is minimal.

> A neural net can capture all those smaller eigenvectors in the signal that are routinely thrown away during traditional feature engineering

What on earth are you talking about?

>Deep nets are here to stay.

Maybe in silicon valley for consumer products in things like snapchat and siri. They won't work for industrial problems.

You'll never be able to develop features with the heuristic methods you described that will work as well as the features learned by a neural net.
Huh, a quick Google search gave me: https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf

This was a paper from Shazam from 2003. This is essentially what I proposed, there is no training. Shazam works pretty well. It's not even going into the mathematical consideration I went into.

>You'll never be able to develop features with the heuristic methods you described that will work as well as the features learned by a neural net.

False.

Quantitative math, or applied math isn't based on fitting data to an arbitrary mathematical structure. It's looking at real life, and deriving the mathematical laws that govern what you see. You could have a neural net predict planetary motion. However, it doesn't know jack shit about physics.

>I have a neural net onboard my phone which automatically detects songs offline and tells me what they are.

MP3 uses something called psycho acoustics, which is a quantitative model on human perception, which is used to eliminate frequencies that can't be heard based on this model.

Your neural network doesn't tell you what features make songs distinct, it's not a quantitative model at all, but a black box heuristic on what the important features are superficially. If actual mathematicians worked on this problem, I guarantee you they'd do a better job, and their models would work on a commadore64, with real time training. Moreover it would tell you things like who is singing, if it's a live performance, which concert it was.

" If actual mathematicians worked on this problem, I guarantee you they'd do a better job"

No, this is wrong.

Some of the most brilliant people in the world have been working on image recognition, voice recognition etc. and AI is crushing all of their work.

"Your neural network doesn't tell you what features make songs distinct, it's not a quantitative model at all" - it doesn't matter at all if our objective is detecting the song. Neither does the mp3 compression algorithm.

>Some of the most brilliant people in the world have been working on image recognition, voice recognition etc. and AI is crushing all of their work.

This is very true. I take my stronger statements back, MAINSTREAM mathematicians attempting this problem are all wrong, and have been wrong for 50 years. But you do need the right theory, and the right math that realizes this theory.

"AI" is superficially beating the work in computer vision. Computer vision is complete bogus. The gabor filters, fourier transfroms etc. are all wrong conceptually. The known methods do abysmally on basic tasks like object recognition, texture segmentation etc. But they keep trying it.

I would take this one step further: computer vision, audio and NLP researchers have been stuck in a rut for the past 50 years. DL is beating THEIR math, but this is because of data and computation speed, not because of any insights. But DL is also wrong, and giving you an illusion of progress. Both of these things are doomed to go the way of GOFAI.

I can go into great detail and carefully explain why MAINSTREAM contemporary ideas in math for vision, audition and language are completely wrong, and have been wrong for 50 years. What is the right model? Like I mentioned before, the right ideas are emerging, neural networks will dominate, just not DL.

Ok so who are the real, non-mainstream mathematicians who would do better?
> It's looking at real life

Collecting observations aka data.

> deriving the mathematical laws that govern what you see

Fitting a model.

> Your neural network doesn't tell you what features make songs distinct

It literally learns better features that you could ever come up with by hand. This is why CNNs do better in computer vision that hand engineered filters.

> I guarantee you they'd do a better job, and their models would work on a commadore64, with real time training.

LOL if you think that a room full of people can listen to TBs of audio data, decide what mathematical functions when combined together are better descriptors of that data than a DL model learning its features.

You don't have the slightest clue what you're talking about.

>If actual mathematicians worked on this problem,

This is a No True Scotsman. Actual mathematicians did work on this problem, training the neural network to achieve it's target task of identifying songs using minimal power and storage consumption - which works.