Hacker News new | ask | show | jobs
by markovbling 3479 days ago
I went to school with Richard and so clearly remember him explaining this to me over coffee less than a year ago when the website had an under-construction style landing page and it's been amazing to watch this grow so fast.

The platform is great and I'd strongly recommend anyone wanting to get machine learning experience or who has played with Kaggle to check out Numerai!

The homomorphic encryption piece is fascinating and I think it'll be an important piece in balancing the privacy vs. utility of personal data as machine learning seeps deeper into the fabric of our lives.

2 comments

Do you have much context on how he got this off the ground in the first place? These sorts of businesses are always very interesting to me from a launch standpoint. What would an MVP look like? How did they get their first users? Etc.
> The trouble with homomorphic encryption is that it can significantly slow down data analysis tasks. “Homomorphic encryption requires a tremendous about of computation time,” says Ameesh Divatia, the CEO of Baffle, a company that’s building encryption similar to what Craib describes.

> According to Raphael Bost, a visiting scientist at MIT’s Computer Science and Artificial Intelligence Laboratory who has explored the use of machine learning with encrypted data, Numerai is likely using a method similar to the one described by Microsoft, where the data is encrypted but not in a completely secure way.

Doesn't this imply that homomorphic encryption isn't being used, but something like it instead?

I am pretty sure homomorphic encryption is not being used. In fact, I suspect that no real encryption is being used.

Isn't it the case that if I just removed the labels, and renormalized all my data to fall in [0, 1], then what I end up with looks a lot like what Numer.ai gives you?

I'm not aware of any homomorphic encryption / structure preserving schemes that have homomorphic evaluation on ciphertexts equivalent to literal multiplication and addition of ciphertexts, and this seems to be what they want you to do to train your model. (unless I'm misunderstanding how to interact with the "encrypted" dataset)

EDIT: seems like most people think they are using Order Preserving Encryption, which allows one to compare ciphertexts with the "less than" predicate. This makes more sense looking at what they give, but I never saw anything where they say "only do comparisons on the encrypted data."

    """
    https://arxiv.org/abs/1508.06574
    "An encryption scheme is said to be homomorphic 
    if certain mathematical operations can be applied 
    directly to the cipher text in such a way that 
    decrypting the result renders the same answer as 
    applying the function to the original unencrypted 
    data."
    The function = GradientBoostingRegressor
    the cipher text = X_encrypted
    original data = X
    same answer = mean absolute error
    """
    import numpy as np
    from sklearn.metrics import mean_absolute_error
    from sklearn.ensemble import GradientBoostingRegressor

    # Replicability
    np.random.seed(0)

    # Create a data set with 1000 samples and 3 features
    X = np.random.randint(0, 60, (1000,3))

    # Create ground truth (the product of the three 
    # features - 100) / 11
    y = (np.prod(X, axis=1) - 100) / 11.
    
    # Encrypt y
    y_encrypted = y + 20

    # Encrypt X
    X_encrypted = X * -0.5

    # Init our model
    rgr = GradientBoostingRegressor(random_state=42)

    # Fit model on first 500 unencrypted features
    rgr.fit(X[:500], y[:500])

    # Predict the remaining 500 features
    preds = rgr.predict(X[500:])

    # Fit model on first 500 encrypted features
    rgr.fit(X_encrypted[:500], y[:500])

    # Predict the remaining encrypted features and decrypt
    preds_decrypted = rgr.predict(X_encrypted[500:]) - 20

    # Evaluate both functions
    print(mean_absolute_error(preds, y[500:]))
    print(mean_absolute_error(preds_decrypted, y[500:]))

    #>>> 323.09
    #>>> 323.72
The encryption here is being done by "adding 20" / "multiplying by -0.5"?

Given this "encrypted" X , y dataset, I could easily find the unencrypted version... (even if I don't know 20 or -0.5, this still reveals so much of the structure that I don't believe it provides any real protection against anything except the most lazy attackers)

It is a toy example to show that a form of homomorphic encryption is possible, without going Fully Homomorphic Encryption.

And simple linear transforms on already anonymized features are not so easy to reverse engineer as you may think. Just try it on a few datasets from UCI.

Ah ok, sure. I wouldn't call something like a linear transform on anonymized features "encryption" (more like obfuscation?), but I guess it's good marketing in that it lets them associate with the "recent advances in [real] homomorphic encryption"
If you desire something more one-way, consider PCA, random projections, feature expansions (with something like Random Bits Regression), hashing, or the last hidden layer activations of your best in-house neural net. Then combine these approaches for good measure.

Agreed on the clever marketing, but at least they put their money (expensive dataset) where their mouth is (release it to reverse engineers the world over).

Fully Homomorphic Encryption challenges would be interesting, but it would disqualify our current state-of-the-art algorithms, and reduce the playing field to a handful of people who know how to write algo's that work with Fully Homomorphic Encryption (if any competitor at all is allowed to work on this, and not too busy working for the NSA).