Hacker News new | ask | show | jobs
JavaScript for Data Science (js4ds.org)
146 points by mrmagoo17 1881 days ago
15 comments

I know that data science is a broad and somewhat vague term but this -

   We will cover:

    Core features of modern JavaScript

    Programming with callbacks and promises

    Creating objects and classes

    Writing HTML and CSS

    Creating interactive pages with React

    Building data services

    Testing

    Data visualization

    Combining everything to create a three-tier web application
- this isn't data science.
I get your point, but as someone doing data science and having no idea about JavaScript, this is actually precisely what I need.

Like, all the stuff "for my data science", such as making a visualization website etc.

out of context it's not data science
It's more like Presenting and Serving Models using Javascript for Data Science.
nobody ever writes books assuming you know how to use the language, I suppose it decreases customer base.
While understanble, I hate this. "Here's 100 pages of python before we get to the good stuff", which ends up not even being good.

Publishers should just offer a free e-book of said language, and make it a requirement.

It decreases the amount of boilerplate "how to program in X" text you have to write. Producing text, especially novel text, is expensive in a non-fiction book.
"JavaScript relies heavily on callback functions: Instead of a function giving us a result immediately, we give it another function that tells it what to do next. Many other languages use them as well, but JavaScript is often the first place that programmers with data science backgrounds encounter them."

That sentence from the book clarifies a lot for me. It is Javascript for Data Science People. Taken in that context this is an excellent book written with empathy for the Data Science user who is usually making uneasy excursions which they hope and pray is only temporary into Javascript and running back to Python the first time they encounter a Promise or a Callback.

The book does cover a lot of basic Javascript material, as its target is actual natural scientists who may not have much experience with the language, but towards the end it does cover things like Data-Forge (which is a data science library in Javascript)
The title is „JavaScript for Data Science“, not „Data Science for JavaScript“. It’s like... in a bar: they will serve a beer for you, so they have the beer and you have you. For a book called „JS for DS“, you should have the the DS while they bring the JS.

Compare this with: „Data wrangling with JavaScript“ [1]

[1] https://www.amazon.de/Data-Wrangling-JavaScript-Ashley-Davis...

Well the problem with “data science” is that it costs a shit ton of money but rarely integrates into anything. A book about wiring data science models into real user facing application maybe isn’t data science, but sure is useful...
I'm glad more people are doing DS/ML/AI with JavaScript, thanks for this book and keep up the great work! -- We are also working in this space, would love to connect, you can find me in javier at hal9.ai
I assume it’s aimed at data scientists who want to learn Javascript? No point teaching DS concepts here.

A better name would be “JS for data scientists

Presumably that's why the "JavaScript for" prefix precedes it?
I don’t want to repeat the old and tired JavaScript hate, but this just isn’t a great idea.

I’d suggest that there are 3 important primitives for data science: flexible numeric types, fast math/algorithm libraries, and data manipulation being easy.

JavaScript doesn’t really have any of these. Numbers are 64bit floats only - no integers, no big numbers. There aren’t equivalents to Numpy/Pandas/Scikit Learn, and the lack of standard library and expressiveness in data manipulation in the language makes basic tasks harder.

JavaScript has its uses, but there’s really no reason to force data science be one of them.

JavaScript has plenty of libraries covering the basics. Here a few:

https://github.com/nicolaspanel/numjs

https://www.npmjs.com/package/fast-math

https://smartbear.com/de/blog/2013/four-serious-math-librari...

That's not the problem. The problem is mindshare and network effects. When analyzing why Python is used one way and JS another we're tempted to retroactively rationalize this with something fundamental about the language. There's nothing fundamental about it. It's just happenstance. Python was around longer as a general purpose script, and it filled that niche. JS is relatively new as a script outside the browser.

The first repo has one core contribitor who hasn't been active since June 2018.

https://github.com/nicolaspanel/numjs/graphs/contributors

I sincerely believe it is possible for JavaScript to be a viable language ecosystem, but there is dire need for cohesion, collaboration, and longevity. As it stands, there are so many potentially viable projects strewn across the NPM landscape like old, discarded toys.

I'm not aware of an initiative, let alone ethos, in the JS community that comes anywhere close to something like NumFocus.

https://numfocus.org/

It is worth mentioning the Danfo project from a sibling comment: https://danfo.jsdata.org/
You can get a long way nowadays with Arquero[0] and Observable[1]. Arquero allows columnar based data storage and processing, with a grammar of data processing verbs similar to e.g. dplyr. Not as fast as vectorized computations in e.g. Python or R, but faster than has previously been possible.

I'm not suggesting these are the first tools you'd reach for for data science in production, but I've found them extremely useful for prototyping, experimenting with algorithms, and visualization. I think it's got to the stage they should be seriously considered for some types of relatively simple data processing work due to their ease of deployment.

[0]https://github.com/uwdata/arquero [1]https://observablehq.com/

Other than the fact we have BigInts now, we also have

* tensorflowjs, which runs on GPUs https://www.tensorflow.org/js and

* danfo, which aims to be a pandas equivalent for JS: https://danfo.jsdata.org/

Given the powerful interactive visualisation capabilities available in JS, its only a matter of time until JS becomes a serious contender IMO.

> Other than the fact we have BigInts now

performance-wise, BigInts are terrible. Tried to use them, made things about a hundred times slower.

That's typical with most JS features, it takes some time for engine performance optimizations to catch up with them. In this particular case I suppose things are moving slower than expected, but with demand increasing prioritization will take place.
I don't see JS as less powerful than Python for data science, it's faster than Python, or can use bindings just like Python. JS is maybe less commonly used than Python in data science nowadays, but I wouldn't be surprised if this changes in next years. There are equivalent libs like tensorflow-core, there are native features like BigInt, and there are libs for 64bits floats (decimal.js, big.js). I'd be glad to spend some time converting Scikit-learn into JS and also show you how expressive JS actually is, if you show me some Python code, I'll translate it
The fact that the number support isn’t part of the language is Linda the problem though.

When you’re writing data science code, the value is in the answer more than the process of getting to that answer. Anything that complicates that gets in the way. This is why things like Pandas are so popular despite having some questionable engineering. Using a library for big number support, having to get that to play nicely with other libraries, it all goes against the aims.

Now for data engineering it’s very different. I wouldn’t choose JS myself, but it’s a much more reasonable choice. For engineering the process by which you get the answer matters far more - is it scalable, testable, repeatable, etc. Having to use a library for big number support is fine.

It’s two very different ways of working and I’m still fairly convinced that JS is not conducive to the former.

>it's faster than Python

Is that generally true for data science type tasks, though, where the "fast" in python is really numpy, pandas, etc?

>or can use bindings just like Python

But there's not really anything like numpy/pandas for it to bind to at the moment, is there? Meaning anything as broad in functionality, fast, mature, etc.

> libs for 64bits floats (decimal.js, big.js)

both of those libraries are for arbitrary precision decimals, not floats.

If it's arbitrary precision, what's the difference, besides slightly more bookkeeping on your end?
We're using NodeJS for data science with Tensorflow JS and it's excellent, at least for our use case, which is 90% discrete data, mostly classification tasks.

NodeJS evented architecture is great for multitasking training (and prediction) jobs. I use Node Streams to extract and process data flows out of several data sources on my Macbook Air M1 using the new Neural Engine 16-core chip to train CNN models with excellent results.

Data prediction then runs on a ReactJS app, which gives my users a way to model, transform and visualize data on a browser. Everything is in Typescript, which reduces cognitive overload for our programmers and produces good end-to-end duck-type coherence and easy, integrated testing.

Now, most data science libs for Javascript are either on their infancy, are proof-of-concept or just abandoned, but TFJS is solid and if you know what, why and to which extent you're using JS for data science, then it's absolutely fine.

You should absolutely read "JavaScript and the next decade of data programming" by Ben Schmidt [1] before outright saying that it wouldn't be a great idea.

JavaScript does have integers (e.g. `Uint8Array`) and it also has big numbers (e.g. `BigInt`). It's true that there's not yet an equivalent to Numpy/Pandas/Scikit yet, but POCs show that it will be possible to create such a thing and that we will be able to use the WebGPU API to access higher performance than is available using Python [2].

I'm not saying that it will definitely happen, but why not?

[1] http://benschmidt.org/post/2020-01-15/2020-01-15-webgpu/

[2] https://github.com/milhidaka/webgpu-blas

> Numbers are 64bit floats only - no integers, no big numbers.

That is not true. BigInt has been available for a bit already.

MDN: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

Availability: https://caniuse.com/bigint

I don't want to argue for or against using JS for "data science" (I myself used R for that but I use JS a lot for other things), just a clarification on this one concrete claim.

> That is not true. BigInt has been available for a bit already.

performance-wise, BigInts are terrible. Tried to use them, made things about a hundred times slower. What JS needs are 64 bit integer types, and some form of typing system that allows differentiating between various number types.

The JIT that understands what number type you want and switches between 31 bit ints and doubles when assumptions are violated without big performance loss. Something similar is likely possible with bigints and 64bit ints
Genuine question: I imagine most data science things involve arrays of numbers, not just single numbers. JS has UInt8Array, i.e. it does kinda have integers if you want them in an array anyway. Can that speed things up?
In my case, it did involved reading binary files from ArrayBuffers where some attributes were stored as int64. I've been using DataView to read BigInt64s and then used those frequently. Was terribly slow. I've then reverted to alternatives wherever possible, e.g. immediately converting the BigInt64s to Numbers/Doubles where 53 bit integers were sufficient, or splitting down to two 32 bit integers and do some more involved bit magic on them. Much faster, but overall I wasn't happy with the whole process and complexity. I'd rather have native 64bit integer values in JS.
> repeat the old and tired JavaScript hate, but this just isn’t a great idea.

There is absolutely nothing wrong with coders/analysts/scientists building solutions in any language. The "hate" that you mention -- and then proceed to echo -- is a narrow way of asserting the superiority of $mylanguage and the inferiority of $yourlanguage.

> flexible numeric types, fast math/algorithm libraries, and data manipulation

Your point b) is usually written in a performant, compiled language, and your point c) can be built from robust primitives in any language. However, I will add a point d) about speed and memory usage.

I do data analysis with the simplest set of performant tools: sqlite, bash-awk-sed-grep, Perl, Python, C++, SVG, and a browser to render. Any kind of glorified REPL beyond a terminal creates fragile complexity and dependency Hell.

My kit doesn't include Node.js or ECMAscript but I'm willing to open my mind enough to think it might, one day. The current tooling for data analysis (or "data science" if we want to be faddish) is a mess and I look forward to better tools in the future.

On point 3 - I had to implement a logistic regression model in js recently and implementing all of the required math methods (eg dot product, transpose, vectorized addition, etc.) were actually super easy with js’s functional array utilities.
js doesn't have a glm library?
js does have a glm library.
Which one are you using?
I’m not using one - as said, the project required I implement the model/all requisite math from scratch.
Well, nowadays you can use WASM with JS to access libraries at near native speed.
the reason to force data science is that same as the reason to develop libraries in languages for tasks which that language might otherwise seem not well suited to, that there is a large userbase of the language who know how to use it and would like to explore using that language for doing other things than it is normally used for. You may of course suggest that they should just learn a new language, but the history of computing shows that solutions for using languages to new purposes they might not seem suited for happens whenever such a purpose arises.
There is decimal.js but yes it's not going to be fast.
the only thing that used to be a problem is the number type. Libraries are ecosystem problem, not inherent to the language.
To address some of the skepticism about when and where javascript would be appropriate in data science, would you want to fit a logistic regression model in javascript? Probably not, but to build a solver that takes model outputs and visualizes the changes in predicted probabilities based on different combinations of variables? This is definitely where javascript would make sense. Visualization, dashboards, reporting, and exploratory analysis are all ripe domains for developing rich responsive UIs. Basically, any layer where you have a data-to-human interface can be leveraged with javascript.

There is a lot of great work happening in this space already. In the R world for example, shiny makes heavy use of js to the point that you often can't tell where R code ends and javascript begins. Plotly's Dash provides bindings for R, Python, and Julia. Personally, as a data scientist, I have been excitedly learning React because it really rips the landscape wide open for all the use cases I mentioned above. It then makes sense to have libraries that give JS users a good data model and can do most of the same numerical computation that we'd be doing in other languages. Again, you probabaly don't want to do serious numerical work in js, but remember people said that about Python ten years ago too.

I love the framing of this book, because I want more data scientists to start thinking about the presentation of data and spark some bits of ingenuity to make datasets and model outputs accessible to non-data scientists. Data scientists should be the ones writing the tools that interface data with humans because of their domain knowledge. But this is a different skillset and usually the work of SW engineers. Of course engineers can also have great data intuition too, but I really do encourage data scientists to develop their front end skills, it's well worth it.

I don't see the point of this. You already have a ubiquitous, easy-to-learn, high-level language that's great for data science, it's called python. If you're a JavaScript developer who wants to get into data science but are too lazy to learn python, you probably weren't that interested in data science in the first place.

Python definitely has some problems, but if you were going to have a new lingua franca for data science, it would probably be something like Julia, certainly not JavaScript.

My hunch is that there has been 10X more investment in engineering for JavaScript: nodejs, webassembly, webgl, webgpu, react native, deno, typescript, electron, chrome, etc. That will be harder to rewrite in Python than to rewrite TensorFlow and a few math libraries in JavaScript.
Data science is not a standardized term, however I don't get what specifically makes this text relevant for the domain of data science... For some data science projects one could surely use javascript, however in mamy cases one misses important libraries, for purposes such as statistical analysis, data manipulation, machine learning, ...
I am a noob to Javascript, so if someone knows better, than please correct me about this, but arrow functions aren't meant to replace normal function syntax, right? From [1], it seems like the main point of arrow syntax is to allow you to inherit the "this" parameter if you are inside a method. Meanwhile, you need normal function syntax if you are creating a constructor, making a method function for a prototype, or making generator functions. (I didn't even know javascript had generator functions until just now :))

So it seems a bit weird to me that they advocate using arrow function syntax instead of the regular syntax. They seem to be advocating using the new class syntax instead, so I guess they don't need the constructor or method creation features of the normal syntax, but I still don't see why they would specifically advocate for arrow function syntax. Is it faster? They say it interferes with other features, but which features?

[1] https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...

I've seen a majority of sources abandon the function keyword entirely in favor of const arrow declarations (and shorthand method syntax).

FWIW I personally like the function keyword, since it's clear what it is to non-JS readers, but primarily because it hoists to the top of its file, so unimportant utility functions can sit unobtrusively at the end of the file, thereby letting readers encounter more important logic earlier in the file.

Interesting to know that what the article recommends is indeed the industry standard. I'd forgotten about hoisting until you brought it up!
Not changing `this` is a huge benefit that shouldn't be ignored. Especially when you're programming in a more functional style, it makes sense to default to arrow functions because you never want to engage in `this` shenanigans anyway. So, yes, I'd say it's a pretty common idiom in the JS community to replace "normal" function declarations.
I agree that inheriting the `this` for arrow functions is beneficial. To me it seems like you would want to use the normal syntax for global functions for hoisting and to prevent unintentional re-definitions, the arrow functions where you would use lambda functions in other languages, and the class method syntax for methods.

side-note: Most of my JS experience is writing userscripts for myself, so I definitely do my share of 'this' shenanigans.

As a heads up since you mentioned "class method syntax", methods are one of the most important places to have lexical `this` binding in many scenarios.

Take the following example, which is a normal class method:

> alertSum() { alert(this.a + this.b); }

And here we have an arrow function used to create an instance method (just an arrow function assigned to a property on the instance):

> alertSum = () => { alert(this.a + this.b); }

Then let's say we want to pass the method directly as callback:

> this.button.addEventListener('click', this.alertSum)

The first example (class method syntax) won't have the necessary `this` context unless it has its context bound to the instance through `Function.prototype.bind`. There are other patterns to avoid this (e.g. wrapping all callbacks in arrow functions when passing them), but it's useful to consider that classes methods can easily create confusion because that's _exactly where_ someone more used to a different language may assume the `this` context is bound lexically.

Excellent point! I can see that getting confusing quickly.

Edit: I was confused about how this could work, so I dug through [1] for a bit. It appears that for each object of that class created, an arrow function will be created on that object and its this will indeed be bound to the same scope that the constructor function is bound to. This is really cleaver and I applaud whoever thought it up!

It is interesting to note that this creates a new arrow function on each object as opposed to the normal definitions which create a single function which is stored in the prototype of the class. (its easier to check this in a browser's dev console then it is to decode the spec)

This would suggest that one should use different approaches for different types of objects: It makes sense to use arrow functions for "resource" or "actor" objects, of which there are few but they may have callback functions. It makes sense to use normal method definitions for "plain old data", of which there may be many, (which would make the arrow functions too expensive) but they should not have callback functions.

[1] https://tc39.es/proposal-class-fields/unified.html

> This is really cleaver and I applaud whoever thought it up!

Not really. It's contortionist and wasteful and one of the many reasons why mainstream web apps are one big celebration of bloat on a boat.

The neophyte programmers who have turned into expert Modern JS programmers are always recommending arrow functions like this because they've never actually looked at the event listener interface. What happens is they try to make things more complicated than they need to be and bodge their event registration. So they apply a "fix" by doing what they do with everything else: layering on even more. "What we need," they say, "are arrow functions."

No.

Go the other way. Approach it more sensibly. You'll end up with a fix that is shorter than the answer that the cargo cult NPM/GitHub/Twitter programmers give. It's familiar to anyone coming from a world with interfaces as a language-level construct and therefore knows to go look at the interface definition of the interface that you're trying to implement.

Make your line for registering an event listener look like this: `this.button.addEventListener("click", this)`, and change the name of your `addSum` method to `handleEvent`. (Read it aloud. The object that we're dealing with (`this`) is something that we need to be able to respond to clicks, so we have it listen for them. Gee, what a concept.) In other words, the real fix is to make sure that the thing we're passing in to `addEventListener` is... actually an event listener.

This goes over 90% of frontend developers' heads (and even showing them this leads to them crying foul in some way; I've seen them try to BS their way through the embarrassment before) because most of the codebases they learned from were written by other people who, like themselves, only barely knew what they were doing. Get enough people taking this monkey-see-monkey-do approach, and from there you get "idioms" and "best practices" (no matter whether they were even "good" in the first place, let alone best).

As a data scientist who does more frontend, I think this is a really valuable concept. Hello by users/stakeholders engage with our work is the way to push it forward in the org and a dash of frontend can do wonders for getting that message across. It’s wonderful that people are making resources about the frontend for data scientists
Glad you also see it this way! Would love to chat with you and get some feedback on a platform we are building at hal9.ai, my email is javier at hal9.ai -- Looking forward to chat.
Just putting this out there: stdlib - a standard library for js, https://stdlib.io/.
I thought of writing a Javascript + tensor flow.js + NLP + web scraping + linked data + etc. book about a year ago. tensorflow.js is especially very cool: well documented with great examples. In fact, it was the great tensor flow.js examples and demos that convinced me to not write the book because I didn't feel like I could do much value add on that subject.
Data scientists are the new webmasters.
Could you elaborate?
hard pass.

even python is not used for data science, all heavy lifting is done in C/fortran, and python is just a glue

Really cool but no one needs this... as a data scientist learning javascript, teach me how to run data science models using javascript! That's where the real gold is... I'm even thinking of writing articles about this myself... JS is great for making things more tangible and interactive
well, I was expecting training a neural network with web-assembly through gpu support in its last chapter :)
They use data-forge.js, which has less stars than danfo.js.

I can't find any benchmark how they compare to data.table or pandas.

Without a dominant and high performance data frame library as a foundation, I wouldn't even try.

Why on earth would you want to use JavaScript for Data Science?
Because some people are monoglots :(
A few reasons, https://venturebeat.com/2021/04/23/4-reasons-to-learn-machin...

Personally, I'm excited to build apps that don't require cloud computing and if they do, have access to one of the largest software engineering libraries through NPM. Sure, I'm not doing just Data Science in JavaScript but rather building apps that use DS/ML/AI, but that's still a valid use case. The alternative would be to use Python for prototyping then rewrite for production apps.