| > I wonder how hard it would be to modify this code to run on a 64GB M2 Mac. It isn't that hard, I was able to run in on M1.
The changes are: remove or modify multiprocessing - it doesn't work on Mac the same way as in the code; replace `device = "cuda"` with `device = "mps"` In this line ` att_idxs = (torch.clamp(torch.arange(context_size)[None, :] - torch.arange(context_size)[:, None], -pos_emb_radius, pos_emb_radius-1) % pos_emb_size).to("cuda")` replace cuda with "mps" in `optim.AdamW` remove `fused=True` - we can't do it without CUDA Replace
```with autocast(device_type='cuda', dtype=torch.float16):
_, loss = mlm_head(bert(batch_data_torch_xs[mb_start_idx:mb_end_idx]), batch_data_torch_ys[mb_start_idx:mb_end_idx])
``` with simply `_, loss = mlm_head(bert(batch_data_torch_xs[mb_start_idx:mb_end_idx]), batch_data_torch_ys[mb_start_idx:mb_end_idx])` replace `scaler.scale(corrected_loss).backward()` with `corrected_loss.backward()` replace
```
scaler.unscale_(optimizer)
scaler.step(optimizer)
scaler.update()
```
with `optimizer.step()` It should work. |