I'm going through the dataset with your datasette tool and it looks like it might be a good idea to clean things up a bit. There are many duplicates[1], creepypastas[2] and other strange things in there.
EDIT: Maybe I'm passing link wrong, the query I'm using is
select count(instruction), instruction, group_concat(context, '
=============
') as c, group_concat(response, '
=============
') as r, group_concat(category, '
=============
') as cat from [databricks-dolly-15k] group by instruction having count(instruction)>1 order by count(instruction)desc limit 100
[databricks-dolly-15k] should be the name of dataset, first column is the number of instruction duplicates
Creepypastas are responses to instruction:
Imagine you are the last person on Earth. Write a diary entry describing your thoughts and feelings.
The labelling doesn't seem to be entirely consistent to me, but I think the idea is that 51 is inviting you to brainstorm, while 68 is asking a question that just happens to be open ended.
Hey! Worked on this here at Databricks: the blog post goes into the dataset collection design a bit (https://www.databricks.com/blog/2023/04/12/dolly-first-open-...). In summary, you're right - brainstorming and GeneralQA will have overlap because the taxonomy naturally has some overlap
[1] https://lite.datasette.io/?json=https%3A%2F%2Fraw.githubuser...
[2] https://lite.datasette.io/?json=https://github.com/databrick...
EDIT: Maybe I'm passing link wrong, the query I'm using is
select count(instruction), instruction, group_concat(context, ' ============= ') as c, group_concat(response, ' ============= ') as r, group_concat(category, ' ============= ') as cat from [databricks-dolly-15k] group by instruction having count(instruction)>1 order by count(instruction)desc limit 100
[databricks-dolly-15k] should be the name of dataset, first column is the number of instruction duplicates
Creepypastas are responses to instruction:
Imagine you are the last person on Earth. Write a diary entry describing your thoughts and feelings.