Hacker News new | ask | show | jobs
by smachlis 4224 days ago
Here's how I think of it, which has been working for me:

matrix - If you have data that would make sense to be in a spreadsheet-type format and all your data are numbers.

dataframe - If you have data that would make sense to be in a spreadsheet-type format and some columns are numbers but other columns are something else (character strings, dates, TRUE/FALSE); but each column is only one thing. That is, you have one column that's all dates, another column that's all numbers, yet another column that's all character strings, etc.

list - if you need to mix data types within a certain entity (vector or column of data).

3 comments

Unless you're doing linear algebra (or really care about memory usage), you almost never need to use a matrix in R.
To piggyback on what hadley said a bit, I find thinking of a data frame as a "collection of records", and a matrix as "two dimensional data" to be a bit better.

One useful heuristic worth asking is "Does it make sense to sort this data by something". In that case, you have a data frame. Whereas if you want to perform matrix math on something (inverting it, multiplying it by another matrix, reducing it, etc.), you have a matrix. Things that I use a matrix for can generally also be expressed as a data frame with columns rowId, colId, and value. If it doesn't make sense in that format, a matrix is generally not the appropriate structure.

That's a great explanation! Data frame for data analysis; matrix for math.
I'd amend that a little: use a matrix when you're actually calculating statistics (internally to the function). Clean your data so it always fits in a data frame when you load it. Lists are for representing things like data scraped from html before converting it to a data frame.