Hacker News new | ask | show | jobs
by alister 2613 days ago
Why does Google deeply index those useless telephone directory sites? Try searching for the impossible U.S. phone number "307-139-2345" and you'll see a bunch of "who called me?" or "reverse phone number lookup" sites. Virtually all of those sites are complete garbage. They make no attempt to collect numbers from telephone directories or from the web. They won't identify a number as being the main phone number for Disneyland for example.

It's odd that so many of those sites exist, that Google indexes them so deeply, and that they show up in searches so prominently. It's obvious that they are spam, scams, or worthless, but those same sites have been appearing prominently for years.

I agree with the author. My experience has also been that Google heavily prioritizes very large and frequently-updated sites over small static information-rich personal sites. I think it's a big flaw that needs to be fixed or for someone else to do better.

17 comments

I have long believed that the proliferation of phone-number-lookup sites was antisocial media: spammers and other bad actors creating these sites to prevent people from having a central place to talk about bad phone experiences, associated with specific numbers. Which one is the good one? You can't tell.
Isn't that exactly what PageRank was designed to handle? None of these spammer sites should have any actual positive reputation.
I wouldn't be surprised if there's a network of non-phone sites linking to them that juice their PR. Like I surmised: it's a broad-based strategy.
Google has cycle detection in PageRank specifically for this reason? They know which regions of the graph are good ones.
Some of these are worse than useless: they hijack real businesses support phone numbers by presenting a high cost number that gets routed to a call center, that then connects you to the business. It's a huge scam.
I've been wondering the same. I think most people google numbers that called, so it has to be a lucrative business. There are a couple sites that are for user reports that are actually nice, but the vast majority seem to be scammy and irrelevant (not even the right number), or completely fake information. Once I googled a number that returned a lot of results for a first initial, last name and address. It was my own mother's number of the last few years that I kept forgetting to save. The results were obviously either fake or many many years outdated
Thoughts after thinking about this comment and thread for a day:

Has the time come for wiki directory of non-commercial (possibly: advertising-free, cookie-free) sites with robust, actually valuable information, and other sites that are doorways to them (think: topical forums, even revived webrings, etc)? Could this feasibly get enough action to be useful?

Think about the proliferation of various "awesome" lists on Github.

Some of them are curated and awesome. Some, less so. Likely some of them are even spammy.

The need is realized, but execution is hard.

Yes. I was looking for a modern _human_ curated directory of web content just the other day and found nothing usable. I don't think that excluding commercial entries would be necessary, but perhaps there could be some way to filter commercial entries out. Ad-free, JS-free, and cookie-free would be ideal.
I recommend: https://href.cool/ as a sick (though highly particular) example.
Although it's early days for this project but do check out: https://github.com/learn-awesome/learn-awesome
Just curious – what content exactly would you be interested in? If any other poster wants to chip in with suggestions, feel free.
I think you should talk to kicks @ https://kickscondor.com/.
https://www.google.com/search?hl=en&q=307%2D139%2D2345

You comment now comes up first, but the rest of the results all try to contact googlesyndication.com, so ads? Google will not exclude sites that literally give them money.

I work on AdSense at Google. We need to crawl all the pages that serve our ads so that we can show ads that are relevant. We actually store those crawl results in a different place than search to prevent exactly that problem. We could probably save a lot of money if we consolidated those indices, but we don't do that prevent biasing search results. As a result a large percentage of pages that show Google Ads are not in the search index.

If Google was so evil, why would we purposefully send traffic to sites where we only get a percentage cut of the revenue?

EDIT: I realize I needed to explain this statement a little bit more. If we show ads on google.com we get 100% of the revenue. If show ads on reversephonelookup.it they get majority of the revenue. There is a limited amount of advertiser demand. Instead of manipulating the organic search results, it would be more profitable for google to just show more ads on the search page or inflate the ad price or something.

If Google was so evil, why would we purposefully send traffic to sites where we only get a percentage cut of the revenue?

I'm pretty sure you didn't mean to open this can of worms, did it have something else printed on its label?

Phone number lookup sites are almost certainly a) easily detectable; and b) low traffic. If Google only gets a percentage cut of this, why index them for fractions of pennies per year?

It's because Google's constraint is number of engineer-attention-hours. This kind of thing probably just isn't a priority, when they have a billion other things they could work on.

Keep in mind that Google is not a product company, they are a data company. They work on the biggest most impactful items, and don't have a lot of time for one-offs.

Finally, it might not be clear at all that these sites should be removed from the index.

Google has vast resources and could easily get this stuff right, but chooses not to. Their protests about not having the capacity can get pretty comical. Note also that the claim about "working on the things that have the most impact" is tautological.
There may be cases where we drop low traffic sites from our index, which is separate from the search index. I don't how much of that is public information, so can't go into detail.
Well but my point is why can't Google recognize these phone-number-lookup sites as the chaff that they are? Nothing should rank lower than them in any search that would return their pages. Said another way: it should be harder for them to clog the top of the SERPs (of course "them" could imply any number of topics).
> As a result a large percentage of pages that show Google Ads are not in the search index.

Those results are organic, at least they were not marked as ads. That's even worse, it means the regular crawler is preferring generated phone number sites over blogs. That's the real money waster right there.

> We need to crawl all the pages that serve our ads so that we can show ads that are relevant.

Why? I though Google's business model was tracking users to show relevant ads to them. You say it's more like DuckDuckGo, or the old magazine ads: film magazines get blockbuster trailers, gardening magazines get compost ads.

Either the ads are personalized, so the surrounding content doesn't matter, or the ads are "static", so why track everyone all the time, then?

> If we show ads on google.com we get 100% of the revenue. If show ads on reversephonelookup.it they get majority of the revenue. There is a limited amount of advertiser demand. Instead of manipulating the organic search results, it would be more profitable for google to just show more ads on the search page or inflate the ad price or something.

As your sibling comments imply, this works like a protection racket. The profit is closing the market to webmasters that don't allow ads at all, or that use non-Google networks.

Because if you show artificially irrelevant results too often (by biasing search results to prioritise AdSense pages), then users would stop using Google Search at some point. Eventually you would loose the revenues coming from the SERP, which are 10 times higher than AdSense :)

Would you risk loosing 10, to get 1 ?

Let's say the search index is size N and the AdSense index is size M. If we were to join them, we would save the storage space for pages that are in both indices.

Also, the search index would gain all the sites that are in the AdSense index so search results would potentially improve. However, it would give an unfair advantage to google publishers.

> If Google was so evil, why would we purposefully send traffic to sites where we only get a percentage cut of the revenue?

Can you explain this? Not sure if it's your phrasing or what, but I'm not getting what you mean.

because it's better to have 10% of all visits than 100% of only some visits. Google is not the only way people get to those pages. Yes, maybe phone registries, but since the policy cannot be split, we are talking about the whole internet.
I've been saying this is inevitable for a long time. If you don't have Google ads they won't show you. YouTube ranks no ad vids lower.

Google is converting the world into a content production factory FOR Google and they pay literally pennies for the work.

If even pennies. Consider how much content Google search now includes from Web pages where you don't even need to click into the page. Weather. Answers to questions. Some links I click keep google.Com in the url and Google processes the page and shows me what Google wants.

I don't even know anymore how much of what I see is what the creator wanted me too see our what Google wants me to see our not see.

Imagine they remove competitors ads with that. Who knows what they do in the name of making the Web better.

You can't go public, answer to no one accountable, who only thinks of MONEY and do no evil.

If Google wants to do no MORE evil. Take yourself private and live up to your credo.

It’s the only result I get for this search now. The ads are low-quality if I disable my adblocker however.
The flip side of trying to "solve" that problem is that you then penalise everyone searching for part numbers, which often do look very similar to phone numbers. I suspect they are making an effort to, because I've been bitten by the CAPTCHA-hellban when searching for part numbers. (Likewise, the results there are also often clogged by a bunch of useless sites claiming they have the datasheet or are selling the part, when all they do is try to show ads.)
>for someone else to do better.

It's upsetting to me that doing better than Google in search seems to be very close to an impossible feat of magic at this point.

I know that will change some day, but I can't see how.

Even if somebody gave you hundreds of millions of dollars to spend on infrastructure and employees, it would still be an insane risk.

Writing that out, it almost sounds like internet search engines should be as big and as important of an operation like the TLD registrars. Funded by big governments in collaboration with each other.

Niche search verticals can still compete. Sourcegraph comes to mind, where you can tweak your search parameters to give superior results for more narrow use cases. I don't google for Recipes anymore because the blogspam is that bad, or for images because I don't want some pinterest or Reddit redirect carousal. I'm sure there are hundreds of example where Google "Lowest common denominator" search does not cut it.
I think there is an easy heuristic, bias against ads.

Google can’t do this. Original sources very likely don’t have ads, scrapes will have tons. But good results are good results. Big g has done an admirable job. They can’t exploit the best metric for quality.

No they don't do an admirable job if they send you to scraped rather than original content. They're ruining the web. While Google still has fantastic, one-of-a-kind services such as Translate, Search isn't one of them anymore, and we should stop cheering at it and relying on it.
Google Search is polluting the internet giving huge monetary incentives for creating of all these copycat sites. Google could eliminate them overnight but they don't for one reason only - Google gets revenue from these sites and not from the original content that is quite often ad-free.[1]

[1] Content scraped from github, tech discusion forums, personal blogs, unix man pages etc.

I've switched to deepl.com recently for my translations.

It has far fewer languages, but the translations of those it has, are far better in my opinion. Allthough I am also no linguist, just an average non-English speaker.

Hmm. I keep getting warnings about responding to quickly. Perhaps my account is under attack.

Setting that aside, google has done good things. Today, not so much. It’s never too late to turn the ship around. Google can still be awesome. Our opinions aren’t that different.

I don’t think big g can turn it around, but I’m rooting for them.

I'm struggling to think of a single historical example of corporate entity the size of google that has "turned it around" rather than abusing the good faith of a customer base for the duration of their race to the bottom.

Google can't still be awesome, as they're no longer seeking to disrupt an existing market and burning through venture capital while doing everything and anything (including providing superior search results, and making ethical business practices part of their brand) to attract users.

Rooting for a profit motivated transnational entity in the manner one would for a sports team exposes the insidious nature of brand narratives and the exploitable irrationality of our own interactions with them.

Microsoft is a good example
> I don’t think big g can turn it around, but I’m rooting for them.

I believe they can, but I don't know if they care to - they're highly profitable after all. What I really dislike about search is that, by treating links as currency, they have broken links. Lots of people won't put plain old links on their site because they fear it would hurt their rankings.

Their overreliance on links as the sole quality indicator has, at least in my country, lead to the large media companies just renting out folders or subdomains - and whatever low quality content is published there ranks at the top.

They still build good stuff, and I'm sure their engineers could "solve" search again, but it appears that their management doesn't want to.

> Hmm. I keep getting warnings about responding to quickly. Perhaps my account is under attack.

I've been getting those too, lately. I thought maybe that my touchpad finger is developing tremor ;)

deepl.com supports fewer languages than Translate, but the quality of translation is so much higher.
Actually it does a good job. When there is some data (like crowd sourced number reporting) on internet about the specific phone number you searched for, Google will show it in the top results

You get a full page of non-sense results and ads/spams when the phone number you searched for is not known from any website (I guess)

That is my experience as well.
I always assumed they collect phone numbers this way. People google a phone number and click on a search result. The site can - by looking at the referrer - extract the phone number and infer that it belongs to someone (and use it for any purpose).

It's a similar mechanism that some forums use to highlight the terms that were part of the google query which lead to this site.

You can’t get extract search query, not from referrer or with analytics software. Google changed that starting around 2011 for logged in users.
Aside from what others have said ITT, i have a personal hate for the general fact that we cannot lookup a phone number on the internet with accuracy and ease.

FFS, 411 was amazing before the web.

Also, in about 1989 my friend and i used to have a contest between us; to call 411 and see who could keep the 411 operator on the phone the longest.

This was a fun social engineering exercise for 14 year old nerds who like the idea of being phreaks.

Our record was 45 minutes and got to know a lot about the 411 system, where call centers were located and how the 411 system worked.

This was right near the time that we ran the long distance bill up to $926 for one month of calling into a BBS in san jose and PCLink to chat....

Got grounded for a month for that one...

> we cannot lookup a phone number on the internet with accuracy and ease

I wondered the same thing and just to speculate, here's my list of reasons on why phone number search is so awful:

- As far as I know, not a single cell phone carrier publishes a telephone directory (whether opt-in or opt-out). So there's no (public) data to index.

- Some landline carriers still publish telephone directories, but of course landlines are dying out. And I remember reading that 30-50% of landline subscribers choose to be unpublished or unlisted anyway. So that source of phone number data is drying up.

- Because international phone calls have become so cheap and caller ID is now easily spoofable, spam and scam calls have become huge. So no one wants their phone number to publicly accessible these days.

- In the early web years, there were legitimate phone directory websites who appeared to have collected their data from landline telephone directories and "city directories" (if anyone still remembers those things). But I guess they didn't find a good way to monetize the service, so the honest phone lookup sites died off.

Google used to have a phonebook search feature, but it was retired years ago. I don't know the reasons, but it might have had something to do with privacy or legal actions.
There is a very simple answer. It's because they get clicked. They might be the only site for given search term.
Also, those people finder sites that scrape public records.

When I search for a name, usually their blog is listed below 10 creepy lookup sites that list their name, physical address history, phone numbers, relatives, etc.

Google should push that garbage to the bottom of the stack.

The first two hits for that number is this thread now...
When there are 0 good results for a query, Google and most users don't care which bad results are served
Its some way to google-bomb a site, because the algo determinating importance by phone-numbers linking is still subvertable as it was 1999?
Maybe Google like to increase the total number of actual search results without increasing the amount of useful content? They also fake the total number of search results, not sure why they would want to do both though...

But either way, it looks like a Google employee have seen your comment and fixed this particular search query.