Hacker News new | ask | show | jobs
by phyalow 660 days ago
Genuine question, whats the difference between your startup and just calling the below code with a different model on a cloud machine, other than some ML/Dev OP's engineer not knowing what they are doing...?

  model = get_model(\*model_config)
  state_dict = torch.load(model_path, weights_only=True)
  new_state_dict = {k.replace('_orig_mod.', ''): v for k, v in state_dict.items()}
  model.load_state_dict(new_state_dict)
  model.eval()
  with torch.no_grad():
  output = model(torch.FloatTensor(X))
  probabilities = torch.softmax(output, dim=X)
  return probabilities.numpy()
1 comments

The advantage of loading from a daemon over loading all the weights at once in Python is that it can support multiple processes or even the same process consecutively (if it dies or something, or had to switch to something else).

Loading from disk to VRAM can be super slow- so doing this every time you have a new process is wasteful. Instead, if you have a daemon process that keeps multiple model weights in pinned RAM, you can load them much quicker (~1.5 seconds for a 8B model like we show in the demo).

You _could_ also make a single mega router process, but then there are issues like all services needing to agree on dependency versioning. This has been a problem for me in the past (like LAVIS requiring a certain transformer version that was not compatible with some other diffusion libraries)