Hacker News new | ask | show | jobs
by simonw 1158 days ago
Here's a link to open up and explore that training data in Datasette Lite: https://lite.datasette.io/?json=https://github.com/databrick...
2 comments

I'm going through the dataset with your datasette tool and it looks like it might be a good idea to clean things up a bit. There are many duplicates[1], creepypastas[2] and other strange things in there.

[1] https://lite.datasette.io/?json=https%3A%2F%2Fraw.githubuser...

[2] https://lite.datasette.io/?json=https://github.com/databrick...

EDIT: Maybe I'm passing link wrong, the query I'm using is

select count(instruction), instruction, group_concat(context, ' ============= ') as c, group_concat(response, ' ============= ') as r, group_concat(category, ' ============= ') as cat from [databricks-dolly-15k] group by instruction having count(instruction)>1 order by count(instruction)desc limit 100

[databricks-dolly-15k] should be the name of dataset, first column is the number of instruction duplicates

Creepypastas are responses to instruction:

Imagine you are the last person on Earth. Write a diary entry describing your thoughts and feelings.

Typo on row 7!
row 7 is the name of the dataset, you might need to load it yourself
Can someone help me to understand why categories for these two differ?

row #51 "Think of some family rules to promote a healthy family relationship" - brainstorsming [1]

row #68 "What is the future for human?" - general_qa [2]

In nature they both are brainstorming to me - does the question mark is what assigned the #68 as _qa?

[1] https://lite.datasette.io/?json=https://github.com/databrick...

[2] https://lite.datasette.io/?json=https://github.com/databrick...

The labelling doesn't seem to be entirely consistent to me, but I think the idea is that 51 is inviting you to brainstorm, while 68 is asking a question that just happens to be open ended.
Hey! Worked on this here at Databricks: the blog post goes into the dataset collection design a bit (https://www.databricks.com/blog/2023/04/12/dolly-first-open-...). In summary, you're right - brainstorming and GeneralQA will have overlap because the taxonomy naturally has some overlap