I read the full post, thanks for writing it. It is very clear, but I do have a couple of questions:
1. In Step (2), Bidirectional RNN: what are you making the forward/backward passes over? How do the tokens get turned into a "matrix" ? What is the dimensionality of this matrix?
2. Step 3 is a bit unclear. Where do Parikh et. al. get their 2 matrices from?
It would be nice to bring in some concreteness: talk about sentences, documents, etc. and how they map into this scheme.
1) Input: (ids1, ids2). These are integer-typed arrays of length len1 and len2
2) sent1 = embed(ids1); sent2 = embed(ids2). Data is now real-value arrays of shape (len1, vector_dim) and (len2, vector_dim) respectively. 300 is a common value for vector_dim, e.g. from the GloVe common crawl model.
3) sent1 = encode(sent1); sent2 = encode(sent2). Data is now real-valued arrays of shape (len1, fwd_dim+bwd_dim), (len2, fwd_dim+bwd_dim).
4a) attention = create_attention_matrix(sent1, sent2). This is a real-valued array of shape (len1, len2)
4b) align1 = soft_align(sent1, attention); align2 = soft_align(sent2, transpose(attention)). These are a real-valued array of shape (len1, compare_dim), (len2, compare_dim)
4c) feats1 = sum(map(compare(sent1, align2))); feats2 = sum(map(compare(sent2, align1))). These are real-valued arrays of shape (predict_dim,), (predict_dim,)
5. class_id = predict(feats1, feats2)
The post describes steps 4a, 4b and 4c as a single operation that takes the two 2-dimensional sentence representations as input and outputs a single vector (obtained by concatenating the representations feats1 and feats2 in this description).
1. In Step (2), Bidirectional RNN: what are you making the forward/backward passes over? How do the tokens get turned into a "matrix" ? What is the dimensionality of this matrix?
2. Step 3 is a bit unclear. Where do Parikh et. al. get their 2 matrices from?
It would be nice to bring in some concreteness: talk about sentences, documents, etc. and how they map into this scheme.
Thanks!