Towards Nyquist Learners

Y	Hacker News new \| ask \| show \| jobs

	Towards Nyquist Learners (gwern.net)
	66 points by sleepingreset 583 days ago

7 comments

ks2048 582 days ago

Title has a misleading domain name (gwern.net). Link is to a PhD thesis titled "Scaling Laws for Deep Learning" by Jonathan Rosenfeld. Not sure why wasn't linked more directly,

https://arxiv.org/abs/2108.07686

https://arxiv.org/pdf/2108.07686#page=85

link

jll29 582 days ago

Generally, if you google a person's name as follows:

  "Jonathan S. Rosenfeld" +DBLP

you will get their computer science publication list.

From that, you can gather that the two main papers that form the core of Rosenfeld's thesis are these:

https://openreview.net/pdf?id=ryenvpEKDr

https://proceedings.mlr.press/v139/rosenfeld21a.html

(if you prefer to read the gist in fewer pages.)

link

telotortium 582 days ago

Gwern, have you considered hosting your archived docs on a different subdomain (e.g., doc.gwern.net) to make it clearer that they are not something you have authored yourself? Not sure what the best subdomain would be though.

link

gwern 582 days ago

I don't think that would make it any clearer. Why would 'doc.gwern.net' be more obviously just a random document than 'gwern.net/doc/www/'?

Regardless, I am puzzled how OP got this URL in the first place. He wasn't supposed to, he was supposed to get the canonical Arxiv PDF link. Because this is one of the cache mirrors/local archives†, rather than a regular hosted document. We block everything in /doc/www/ in robots.txt & HTTP no-archive/crawl/mirror/etc headers, and we use JS to swap out the local URL for the original URL whenever the reader clicks or mouse-overs or interacts with a link to the URL in a web page (and that is the only place they should be publicly listed or accessible). If OP read it on gwern.net by seeing a link to it, and he wanted to copy the URL elsewhere, he should have just gotten the canonical "https://arxiv.org/pdf/2108.07686#page=85"... But somehow he didn't.

OP, do you remember how exactly you grabbed this URL? Is this an old link from before our URL swapping was implemented, or did you deliberately work around it, or did you find some place we forgot to swap, or what?

(If anyone is wondering why I mirror Arxiv PDFs like this in the first place: it's for the PDF preview feature in the popups. Because Arxiv blocks itself from being loaded in iframes we need local mirrors for PDF preview to work at all; local mirrors save a new domain lookup and speeds up the PDF preview a lot because we compress the PDF more thoroughly and Arxiv servers are always overloaded; and because readers can potentially pop up many Arxiv PDFs easily, it saves Arxiv a lot of bandwidth and avoid burdening their servers further, so it's just the responsible thing to do.)

† https://gwern.net/archiving#preemptive-local-archiving

link

xelxebar 581 days ago

Not OP, but the HN crowd here often browses without JS. Quickly testing a no-JS session, I do see your archive URLs instead of arxiv ones.

link

gwern 581 days ago

Yes, without the swapping JS, you wouldn't get the canonical URL. But browsing Gwern.net these days without JS is pretty painful. And in this particular case, there is only one place on Gwern.net that the link exists where you could see it without JS; in the other 5 or 6 links, you could only get there via JS and thus the swapping should've happened. So it is not a safe assumption that OP simply browsed with NoScript.

link

sleepingreset 580 days ago

Hi Gwern, I'm honestly not sure. I have some firefox extension that skips trackers and other redirects. I have like 100 firefox extensions, actually. I'm not sure how most of them work nor what they do exactly, I just trust that they make my browser more "secure" and I tend to download things at random -- especially if I see ads or want certain features in my client (i.e. a browser that auto-rejects cookies).

Happy to try and help you figure this out but when I revisit this specific hyperlink I'm still getting the gwern url & not arxiv

link

littlestymaar 581 days ago

> Why would 'doc.gwern.net' be more obviously just a random document than 'gwern.net/doc/www/'?

HN only shows the domain next to the title. So now when browsing the front page we only see gwern.net as the source of the doc and initially assume it's some work from you.

link

jolmg 581 days ago

I don't think HN shows third-level domains, so the point is moot. There may be exceptions for web services that lend out subdomains like Github[1], but doc.gwern.net would probably still show as gwern.net[2]. If you're willing to see the URL in the browser statusbar or addressbar, then the URL path makes very clear that the actual source is arxiv.org.

[1] Example: gliimly.github.io -> gliimly.github.io https://news.ycombinator.com/item?id=42148808

[2] Example: www.researchgate.net -> researchgate.net https://news.ycombinator.com/item?id=42181345

link

littlestymaar 581 days ago

You're right, I didn't realize that the third-level domains that show up may be due to some kind of whitelisting.

The [2] was not a convincing example because www sound something that'd get special treatment, but then I found this one:

tech.marksblogg.com -> marksblogg.com (https://news.ycombinator.com/item?id=42182519)

which proves you right. TIL.

link

antasvara 581 days ago

That brings up the second question though, which is why someone would assume that docs.gwern.net links to a document not by Gwern.

link

telotortium 581 days ago

That's why I'm trying to think of a better subdomain.

- archive.gwern.net?

- static.gwern.net?

- thirdparty.gwern.net?

- localarchive.gwern.net?

link

svantana 582 days ago

I think the basic premise of this paper is wrong. Very few natural signals are bandlimited - if images were, they would be no need to store in high resolution, you could just upsample. Natural spectra tend to be pink (decaying ~3dB/octave), which can be explained by the fractal nature of our world (zoom in on details and you find more detail).

link

wbl 582 days ago

JPEG allocates very few bits to the higher frequency elements of the blocks, especially in chroma. https://vicente-gonzalez-ruiz.github.io/JPEG/#lossy-jpeg

link

vlovich123 582 days ago

Of course that says that our eyes (& more generally our sensory organs) are bandlimited which is what lossy signal compression algorithms exploit (similar to how MP3 throws away acoustic signals we can't hear or how even "lossless" is still only recorded at 44 kHz). And indeed any sensor has this problem and it's a physical limitation (e.g. there's only so much resolving power an optical sensor of a certain size can have for an object of a certain distance away which is why we can't see microscopic things and this is a limit from the physics of optics)

It says nothing about the underlying signal in nature. But of course we're building LLMs to interact with humans rather than to learn about signals in the true natural world that we might miss.

link

wbl 582 days ago

Any optical system will have a finite resolution.

link

astrange 582 days ago

That applies to individual samples. The eye gets around this by saccading (rapid movements) to get multiple samples. Also, you interact with your environment rather than passively sampling it, so if you want to look closer at something you can just do that.

Images aren't truly bandlimited because they contain sharp edges; if they were bandlimited you'd be happy to see an image upscaled with a Gaussian kernel, but instead it's obviously super blurry.

When we see an edge in a smaller image we "know" it's actually infinitely sharp. Another way to say this is that a single image of two people is fundamentally two "things", but we treat it as one unified "thing" mathematically. If all images came with segmentation data then we could do something smarter.

link

pvillano 581 days ago

"In optics, any optical instrument or system – a microscope, telescope, or camera – has a principal limit to its resolution due to the physics of diffraction." This might be what wbl is referring to.

link

pvillano 581 days ago

We've seen band limited CNNs https://nvlabs.github.io/stylegan3/

What would the implementation of a band limited LLM look like?

link

gwbas1c 582 days ago

> In particular, this minimal frequency is twice the bandwitdh of the function.

Careful, this is misleading.

If the peaks of the frequency align with your samples, you'll get the full bandwidth.

If the 0-crossings align with your samples, you'll miss the frequency.

These are why people swear by things like HD audio, SACD/DSD, even though "you can't hear over 20khz"

link

luma 582 days ago

You've misunderstood something about Nyquist. A sample rate of, say, 44KHz, will capture ALL information below 22KHz and recreate it perfectly.

There are of course implementation details to consider, for example you probably want to have a steep filter so you don't wind up with aliasing artifacts from content above 22KHz. However it's important to understand: Nyquist isn't an approximation. If your signal is below one half the sample rate, it will be recreated with no signal lost.

link

GlenTheMachine 582 days ago

Nyquist is a mathematical statement. As such, it has two commonly overlooked requirements:

- the signal being sampled has to be stationary

- you have an infinite number of samples

In that case, a sampling frequency of 2N+epsilon will perfectly reproduce the signal. Otherwise there can be issues.

link

alanbernstein 582 days ago

I don't recall seeing Nyquist described with those requirements before. I think it is evident that in the real world, there are many practical signals which do not exactly meet those requirements, but which still yield nearly-exact reproduction.

I wonder, what are some examples of signals that fail to reproduce after sampling in a way that is "nearly Nyquist"?

link

GlenTheMachine 581 days ago

If you look at the Wikipedia entry on the Nyquist Sampling Theorem, you should note that the summations to reconstruct the original signal go from negative infinity to positive infinity. In other words, that sum requires an infinite number of samples.

There are many signals of practical interest that can be approximately reconstructed with a finite truncation of the series. Note, however, that any signal that has only a finite length, eg has a uniformly zero amplitude after some time t_final, does not have a finite bandwidth, and cannot be exactly reconstructed by any sampling scheme. This is the case whenever you stop sampling a signal, eg it is always the case whenever you step outside the mathematical abstraction and start running real code on a real computer. So any signal reconstructed from samples is always approximate, except for some relatively trivial special cases.

link

drdeca 581 days ago

Hm, yes, a function cannot have bounded support in both the time domain and the frequency domain…

What if you take a function that has bounded support in the time domain, and then turn it into a periodic function? Might the resulting function have bounded support in the frequency domain even though the original function did not? I suppose doing this would force the Fourier transform to have discrete support? But under what conditions would it have bounded support?…

I guess technically a low-pass filter applied to a signal with finite support in the time domain, would result in a function which has infinite support in the time domain.

I suppose sinc(f t + c) doesn’t have bounded support, and it is unsurprising that a non-trivial linear combination of finitely many terms of this form would also not have finite support.

Still, such a linear combination could decay rather quickly, I imagine. (Idk if asymptotically faster than (1/t) , but (1/(f t)) is still pretty fast I think, for large f.)

Soon enough the decay should be enough that the amplitude should be smaller than the smallest that the speaker hardware is capable of producing, I suppose.

link

ImageXav 582 days ago

I think it is you who have misunderstood the Nyquist-Shannon theorem. Aliasing and noise are real concerns. Tim Wescott explains it very well [0] (Figures 3, 10 and 11). If your signal is below one half the sample rate but the noise isn't, you'll lose information about the signal. If your signal phase is shifted wrt. the sampling, you'll lose information. If your sampling period isn't representative, you'll lose information. These are not implementation details.

[0] https://www.wescottdesign.com/articles/Sampling/sampling.pdf

link

StrangeDoctor 582 days ago

I was just about to post something saying similar. If I had to guess,

>If the 0-crossings align with your samples, you'll miss the frequency.

This is where the issue is. This isn’t possible with more than double the sampling rate.

link

kevin_thibedeau 582 days ago

It can only happen with a source exactly at N/2 and correlated with your sampling clock. That doesn't happen in the real world for audio.

link

mlyle 582 days ago

Anything close to N/2 is going to have varying magnitude that requires filtering and likely oversampling to remove.

How close to the Nyquist bandwidth you can get depends upon the quality of your filtering.

44.1KHz is a reasonable compromise for a 20KHz passband. 48KHz is arguably better now that bits are cheap-- get a sliver more than 20KHz and be less demanding on your filter. Garbage has to be way up above 28KHz before it starts to fold over into the audible region, too.

link

Sesse__ 582 days ago

> Garbage has to be way up above 28KHz before it starts to fold over into the audible region, too.

You brick-wall everything at 20 kHz (with an analogue filter) before you sample it; that's part of the CD standard, and generally also what all other digital CD-quality audio assumes. This ensures there simply is no 28 kHz garbage to fold. The stuff between 20 and 28 in your reconstructed signal then is a known-silent guard band, where your filter is free to do whatever it wants—which in turn means that you can design it only for maximum flatness (and ideally, zero phase) below 20 kHz and maximum dampening above 28 kHz (where you will be seeing the start of your signal's mirror image after digital-to-audio conversion), not worrying about the 20–28 kHz region.

link

marcosdumay 582 days ago

Yep, that's why people do things like 44kHz sampling instead of 40kHz.

link

Sesse__ 582 days ago

No, 44 kHz is because you want to reconstruct the (20 kHz) bandlimited signal and it's (much) easier to realize such a filter if you have a bit of a transition band.

link

gwbas1c 580 days ago

> You've misunderstood something about Nyquist. A sample rate of, say, 44KHz, will capture ALL information below 22KHz and recreate it perfectly.

Let's do a thought experiment. Imagine a digital image where the pixels are the exact minimum size that you can see.

If a line is exactly 1-pixel-wide, it'll display perfectly when it aligns perfectly with the pixels.

But, if the 1-pixel-wide image doesn't align with the pixels, what happens?

You can see this in practice when you have a large screen TV, and watch lower-resolution video. Smooth gradients look fine, but narrow lines have artifacts. IE, I recently saw a 1024p movie in the theater and saw pixels occasionally.

The same thing happens in sound, but because a lot of us have trouble hearing high frequencies, we don't miss it as much.

link

01HNNWZ0MV43FF 582 days ago

How bad is it around the frequencies I can hear as a 30-something?

link

pvillano 582 days ago

Wasn't there an paper on band limiting generative CNNs, that fixed texture pinning? Basically by blurring the results of the kernel with neighbors, you get rid of all this aliasing?

link

soraki_soladead 582 days ago

Alias-Free GANs? https://nvlabs.github.io/stylegan3/

link

pvillano 581 days ago

Thanks. Is this not effectively an implementation of the Nyquist Learners idea?

link

woopwoop 582 days ago

I don't understand what their definition of a band limited function on a manifold is supposed to be.

link

esafak 581 days ago

Could it be something like the spectrum of the Laplace-Beltrami operator?

link

puttycat 582 days ago

Off topic, this thesis has one of the most concise and straightforward acknowledgments section I saw.

link