|
|
|
|
|
by kajecounterhack
305 days ago
|
|
I tried mapping back to closest token embeddings. Here's what I got: global_step = 1377; phase = continuous; lr = 5.00e-03; average_loss = 0.609497
current tokens: ' Superman' '$MESS' '.");' '(sentence' '");' '.titleLabel' ' Republican' '?-'
global_step = 1956; phase = continuous; lr = 5.00e-03; average_loss = 0.589661
current tokens: ' Superman' 'marginLeft' 'iers' '.sensor' '";' '_one' '677' 'ยป.'
global_step = 2468; phase = continuous; lr = 5.00e-03; average_loss = 0.027065
current tokens: ' cited' '*>(' ' narrative' '_toggle' 'founder' '(V' '(len' ' pione'
global_step = 4871; phase = continuous; lr = 5.00e-03; average_loss = 0.022909
current tokens: ' bgcolor' '*>(' ' nomin' 'ust' ' She' 'NW' '(len' ' pione'
"Republican?" was kind of interesting! But most of the strings were unintelligible.This was for classifying sentiment on yelp review polarity. |
|
Consider one of the embedding vectors in the input tensor: nothing guarantees its exactly on, or close to a specific token. Hence the probabilities with respect to each token form a distribution, ideally that distribution should be one-hot (lowest entropy) and worst case all equal probability (highest entropy), so just add a loss term penalizing the entropy on the quasitokens, to promote them to take on actual token values.