|
|
|
|
|
by joerick
597 days ago
|
|
The thing that puzzles me about embeddings is that they're so untargeted, they represent everything about the input string. Is there a method for dimensionality reduction of embeddings for different applications? Let's say I'm building a system to find similar tech support conversations and I am only interested in the content of the discussion, not the tone of it. How could I derive an embedding that represents only content and not tone? |
|
To adapt this to your problem of ignoring writing style in queries, you could collect a few text samples with different writing styles but same content to compute a "style direction". Then when you do a query for some specific content, subtract the projection of your query embedding onto the style direction to eliminate the style:
I suspect this also works with text embeddings, but you might have to train the embedding network in some special way to maximize the effectiveness of embedding arithmetic. Vector normalization might also be important, or maybe not. Probably depends on the training.Another approach would be to compute a "content direction" instead of a "style direction" and eliminate every aspect of a query that is not content. Depending on what kind of texts you are working with, data collection for one or the other direction might be easier or have more/fewer biases.
And if you feel especially lazy when collecting data to compute embedding directions, you can generate texts with different styles using e.g. ChatGPT. This will probably not work as well as carefully handpicked texts, but you can make up for it with volume to some degree.