Hacker News new | ask | show | jobs
by scottrogowski 1814 days ago
ONSITE - Location TBD

I'm a solo founder and am interviewing next week in the final stage of a somewhat well-known tech incubator.

The open source framework / SaaS I'm developing is called Fastmap https://fastmap.io. The one-liner is: "Fastmap offloads and parallelizes arbitrary Python functions on the cloud" but I'm also considering the more colloquial, "Fastmap is a server in your pocket". My short term strategy is to market to data scientists. My long term strategy is to offer a smarter way to do backend infrastructure.

As far as skill set, I am looking for someone with a deep level of expertise in distributed systems and/or data engineering. I would also be open to a non-technical co-founder with business experience in SaaS.

Building a co-founder relationship at a late stage is inherently tricky. For what it's worth, I believe that relationships work best when they are equitable and that 95% of the work of Fastmap is in the future.

Interested? Email me! Let's see if we can find a way to test whether we click: scott@fastmap.io

1 comments

Interesting product - have felt an acute need for something similar while writing machine learning preprocessing pipelines without having to spin up a dask or pyspark cluster.

How are you dealing with data streaming latency here? For most of the things I've worked on the compute needs grow O(n) or O(n^2) with the dataset size. Farming out the compute to a remote server might solve the CPU bottlenecks but at the expense of having to pay the network transfer overhead. For the speed of most pipelines I'm not sure that's a viable tradeoff.

Hey, a couple of thoughts:

I think you identified the tradeoff correctly. What I'm building is designed for when compute needs outweigh the data transfer overhead. This works for some applications and not others. In particular, I think my approach will work especially well for ML model training and web scrapers.

That being said, the fewer bytes transferred the better. Apart from colocating servers, there's not much I can do for the data. No matter what, it has to go over an ethernet cable. It's possible to cache source code though and is something I'm working on doing over multiple layers.

In your example, for preprocessing pipelines, Fastmap probably wouldn't make sense - or at least that's my instinct. In my experience, it's rare to see pipeline steps where the compute significantly outweighs the data transfer. I'd be curious to hear your problem though. I might have a blind spot?

Makes sense.

A lot of my work involves parsing crawled pages, ie. html -> structured dom tree -> some heuristic search or other algorithm to tag or extract elements. So in this case the compute per page is relatively high but I still think the overall latency of having to sync to a remote endpoint is too heavy.

One hybrid approach would be some hosted data product that is colocated with your compute cluster. On nonsensitive datasets I'd be happy to upload my initial data upfront if it means I can get linear scalability when I need to run compute.

Ah. I think I understand a little better. Knowing nothing else about what you're doing, the way I would structure that would be to run multiple fastmap tasks with crawler + processor code and then pull down just the processed data. So basically, have fastmap as the most upstream node.

Whether or not that's possible with the realities of your application is another question of course :).

If you're curious, I'd be happy to provide some compute time (say 100 vCPU-hours) in exchange for feedback. Email me if that's at all interesting.