| I'm going through the dataset with your datasette tool and it looks like it might be a good idea to clean things up a bit. There are many duplicates[1], creepypastas[2] and other strange things in there. [1] https://lite.datasette.io/?json=https%3A%2F%2Fraw.githubuser... [2] https://lite.datasette.io/?json=https://github.com/databrick... EDIT: Maybe I'm passing link wrong, the query I'm using is select count(instruction), instruction, group_concat(context, '
=============
') as c, group_concat(response, '
=============
') as r, group_concat(category, '
=============
') as cat from [databricks-dolly-15k] group by instruction having count(instruction)>1 order by count(instruction)desc limit 100 [databricks-dolly-15k] should be the name of dataset, first column is the number of instruction duplicates Creepypastas are responses to instruction: Imagine you are the last person on Earth. Write a diary entry describing your thoughts and feelings. |