Hacker News new | ask | show | jobs
by jah242 1195 days ago
Whilst not the same I recommend you look at the DeepMind Gato paper to see surprisingly (relatively) simple multi modal can be - https://openreview.net/forum?id=1ikK0kHjvj

Essentially to merge lots of modalities they just go 'let's convert all modalities into integers in the same given range', e.g the word 'me' = 1001, up in Atari = 11002, joint torque of right motor of robot = 33000 and so on.

From the paper:

There are infinite possible ways to transform data into tokens, including directly using the raw underlying byte stream. Below we report the tokenization scheme we found to produce the best results for Gato at the current scale using contemporary hardware and model architectures.

• Text is encoded via SentencePiece (Kudo & Richardson, 2018) with 32000 subwords into the integer range [0, 32000).

• Images are first transformed into sequences of non-overlapping 16 × 16 patches in raster order, as done in ViT (Dosovitskiy et al., 2020). Each pixel in the image patches is then normalized between [−1, 1] and divided by the square-root of the patch size (i.e. The tokenized result is a sequence of integers within the range of [0, 1024). 16 = 4).

• Discrete values, e.g. Atari button presses, are flattened into sequences of integers in row-major order.

• Continuous values, e.g. proprioceptive inputs or joint torques, are first flattened into sequences of floating point values in row-major order. The values are mu-law encoded to the range [−1, 1] if not already there (see Figure 14 for details), then discretized to 1024 uniform bins. The discrete integers are then shifted to the range of [32000, 33024).

1 comments

The interesting thing to me is that our brains probably do something similar, converting multi-modal sensory data into the same 'model' that we experience as our concsiousness.