Hacker News new | ask | show | jobs
by AndrewKemendo 2620 days ago
Depends on what you're trying to do.

Are you putting a trained inference model into production as a product? Is it a RL system (completely different architecture than an inference system)? Are you trying to build a model with your application data from scratch? Are you doing NLP or CV?

As a rule of thumb I look at the event diagram of the application/systems you're trying to implement ML into, which should tell you how to structure your data workflows in line with the existing data flows of the application. If it's strictly a constrained research effort then pipelines are less important, so go with what's fast and easy to document/read.

Generally speaking, you want your ML/DS/DE systems to be loosely coupled to your application data structures - with well defined RPC standards informed by the data team. I generally hate data pooling, but if we're talking about pooling 15 microservices vs pooling 15 monoliths, then the microservices pooling might be necessary.

Realistically this ends up being a business decision based on organizational complexity.

1 comments

Thanks for the reply. Could you give some more insight into how and what tools you choose for the different sort of tasks (say NLP vs CV vs RL)? Also, how and why are different tools/pipelines better for production and product building?
How you parse and manage the inputs is significantly different between those types.

With NLP as one example, you need to determine when are you going to do tokenization? - aka break up the inputs into "tokens." So do you do this at ingest, in transit, at rest?

With CV you don't need to do tokenization at all (probably).

So the tools really come out of the use case and how/when you put them into the production chain.