Hacker News new | ask | show | jobs
by deliberateJack 1795 days ago
I am selling a database with ten billion phone numbers. 1.25 GB file with each number compressed to a single bit. You can compare the clubhouse database against mine to determine which numbers are not in their set.
5 comments

Knowing which numbers are capable of receiving SMS and which aren't has some value.

Especially in a world of number portability where you can't just say "oh, that's an old number, it must be POTS".

But I guess, here, if a number is from your contact list, it may still be POTS.

But at least you have higher assurance that it's an active user. If you wardial one day, you quickly find out how many numbers never lead to a human for various reasons. In theory, some of these are trap numbers and quickly flag the caller as suspicious, but I doubt it.

"Knowing which numbers are capable of receiving SMS and which aren't has some value."

This isn't difficult - I wrote a shell script named "lookup" that will give me background info for any phone number I feed it and tell me what kind of number it is, what carrier it is, who it belongs to, etc.:

  # lookup 415-333-2222

  {"caller_name": {"caller_name": "WIRELESS CALLER", "caller_type": null, "error_code": null}, "country_code": "US", "phone_number": "+14153332222", "national_format": "(415) 333-2222", "carrier": {"mobile_country_code": "311", "mobile_network_code": "489", "name": "Verizon Wireless", "type": "mobile", "error_code": null}, "add_ons": null, "url": "https://lookups.twilio.com/v1/PhoneNumbers/+14153332222?Type=carrier&Type=caller-name"}
... which is very useful since I often send (personal) SMS from the command line and sometimes I need to know if a number can receive it ...

I'm not going to paste the entire script here but the meat of it is:

  /usr/local/bin/curl -X GET "https://lookups.twilio.com/v1/PhoneNumbers/$number?Type=carrier&Type=caller-name" -u $accountsid:$authtoken
... and each lookup costs a penny or a half a penny or something ... I forget ...
How would your script obtain this information though? Relying on twilio?
In some countries mobile phone numbers have a prefix so you know by that.
Also some POTS provider will accept SMS and either read it to you, or you can read them in some web portal (or the router possibly).
The Local Routing Number provides this value in the USA, and multiple carriers (eg:Twilio) offer daily deactivation reports from the cellular carriers so you can tell which numbers are unroutable.
Canada isn't as progressive. Only telecoms can see which telecom a number points to and for the purpose of call-routing only.
Great. It’s the weekend and I can theoretically now stop thinking about software, and yet here I am thinking of ways to efficiently compress lists of phone numbers
There was a thread about that last month,

https://news.ycombinator.com/item?id=27549075 ("Sorted Integer Compression")

The rabbit hole deepens…
Just enumerate them all, if none is missing it's fairly easy to compress. (And 1b per number is really inefficient) ;-)

main = traverse print [1..99999999]

The Kolmogorov complexity of the set of all phone numbers is pretty low. All phone numbers with a few missing is also pretty low.

In fact, I now wonder if you can even compress the 3.8b phone number set to less than 1 bit per phone number. It should be pretty doable since a significant chunk of the number space is not valid.

But not all numbers are valid? 911. Not all area codes exist.
What language is that?
Haskell
Presumably all non-american ones are not on your list?
How much?
I have even better - for every country, just covering all their operator's prefix and then 99999-9999999 numbers in that range. Definitely the biggest dataset around, and bigger is alwyas better, right?