Hacker News new | ask | show | jobs
by ilaksh 1083 days ago
Please state the assignment in detail. What are you classifying exactly, for example.

If it's just some simple text that is fairly regular then you might not need actual fine tuning. You could just use the OpenAI API.

https://github.com/leehanchung/lora-instruct

What exactly are the HPC resources. Are they GPUs and what type and how many.

1 comments

Thank you for your answer. The objective of the assignment is to classify agents of misinformation based on their tweets. An example element of the dataset can be found at this link: https://ibb.co/16VMTCN. There is a dataset for the control group and a dataset of misinformation agents. The idea is to make the model closer to how misinformation agents are via fine-tuning on these datasets. The available HPC resources, including information about GPUs and their quantity, can be viewed at this link: https://www.carc.usc.edu/user-information/user-guides/hpc-ba...
Ok. I think I understand the assignment. I don't believe you need to fine tune any model.

You can probably just use the OpenAI ChatGPT model and ask it something like:

"Does this user's tweet say anything negative about the government of ______ or contradict any of these official party statements? __________"

You can probably just ask Falcon or Llama the same thing without any training. But if you decide you have to do the fine tuning then try with my link above using the A100 GPU nodes.

I think the whole thing is nonsense though. Because whoever the arbiter of truth is always has an agenda and often makes mistakes.

Thanks! Btw, a link to resources would still be appreciated if I need to apply the knowledge to personal projects in the future.
If the assignment is to classify the type of misinformation (assuming each tweet is misinformation) then it’s essentially topic modeling which is very doable without fine tuning as well.
This is what I was thinking about using LLM for: 1. As a feature extractor. For example, given the text of misinformation agents, what are the characteristics? C1, C2, C3, etc. Then, do these characteristics appear in these new texts? Assign a label accordingly. 2. I'll give LLM the text on how they usually behave and ask if these new ones are behaving similarly. If so, label them accordingly. (There may also be the possibility to pass graph data in a graph-less way.) 3. Use the extracted information to enhance topology-driven classification
Those might work to some extent but keep in mind the model doesn’t have access to outside information, and it’s going to be nearly impossible to build a social graph given Twitter API limits.

IMO the easiest way to fine tune your model would be to use something like BERT embeddings fine tuned with triplet loss i.e. (example, positive, negative) to train the model to minimize distance between similar examples and maximize between dissimilar ones.

Very interesting! Thank you for the idea. I will try to figure out how to do that
This is a better link since it mentions custom.data: https://yashugupta-gupta11.medium.com/qlora-efficient-finetu...