In this article I am going to show you three different ways to create a dataframe in R:
- Creating a new data frame row-wise
- Doing it in column notation
- Creating an empty data frame
Creating a new dataframe by hand
A dataframe is, pretty much, just like a sheet in an Excel workbook. It’s a lot better in many ways, but if you are just starting out learning R that’s the image you should start with.
That means you can think about it like a table, and like any table it has two dimensions: left-to-right and top-to-bottom. Going from left to right across the table, you are seeing columns, and going from top to bottom you are seeing rows.
When you are creating data frames from scratch in R, instead of reading them in from an Excel spreadsheet or any other source, you have two ways to accomplish the task.
The first way is by using row-wise notation, and the second one is column notation.
I am going to show you how to create a dataframe using row-wise notation first, because it’s more intuitive for manual entry, and then I’ll show you column notation.
A new dataframe using row-wise notation
So, open up RStudio and create a new script file and enter the following code. Do this by hand, so no cheating and doing copy-pasta. If you do that you won’t learn and you might as well watch YouTube videos of chocolate rain or double rainbows1. Now on with it:
library(tidyverse)
cool_people = tribble(
~first_name, ~last_name, ~age,
"Ali", "Khouri", 72,
"Wei", "Li", 14,
"Julia", "Harrington", 25,
"Olaf", "Johnsson", 9
)
print(cool_people)
If you run this code, this is what you should see:
# A tibble: 4 x 3
first_name last_name age
<chr> <chr> <dbl>
1 Ali Khouri 72
2 Wei Li 14
3 Julia Harrington 25
4 Olaf Johnsson 9
But what’s that? Why does it say # A tibble: 4 x 3
? What does that mean?
Tibbles are modern version of data frames. They are much better in a number of ways, none of which I’m going to explain right now. The important thing to know is that they come as the standard with any package in the tidyverse
and that they are the reason we can build data frames in row-wise notation as we just did. So let’s have a closer look at that.
In the code block above, we create a data frame called cool_people
. We do so by using the tribble()
command, which stands for “row-wise tibble”. Then, we write the column names, preceded by a ~
character:
cool_people = tribble(
~first_name, ~last_name, ~age,
...
)
Below that then come the individual rows with their values, divided by commas. Kinda looks already like a table, doesn’t it?
Usually, however, you do not enter data manually like this into your R code itself. It’s great for little examples like this, but most of the time you are working with long vectors of information that you got from somewhere else.
A new dataframe using column notation
Most often when you come across R code online, you won’t see the tribble()
code above. Instead, if you want to build a data frame manually, you’ll see the column notation. Let’s try that out and create the same dataframe like above, but the more conventional way:
cool_people = tibble(first_name = c("Ali", "Wei", "Julia", "Olaf"),
last_name = c("Khouri", "Li", "Harrington", "Johnsson"),
age = c(72, 14, 25, 9))
print(cool_people)
This looks less like a table while typing it, but it gives you exactly the same result as the code earlier.
# A tibble: 4 x 3
first_name last_name age
<chr> <chr> <dbl>
1 Ali Khouri 72
2 Wei Li 14
3 Julia Harrington 25
4 Olaf Johnsson 9
Note that we are using the tibble()
function here, not the tribble()
function.
What’s happening here is that we are making a list of column names and then pinning the list of values for each column to that column name. What do I mean by that?
Well, consider this example, which looks a bit strange but will work and give you a totally empty data frame:
sooo_empty = tibble(first_name = character(),
last_name = character(),
age = numeric())
print(sooo_empty)
# A tibble: 0 x 3
# … with 3 variables: first_name <chr>, last_name <chr>, age <dbl>
As you can see, the result is an empty data frame, with zero rows but three columns. What we did was that we listed out (separated by commas) what we wanted our column names to be and also stated what data type the columns should have. This is something that we only have to do when we create empty data frames, since R is smart enough to automatically know this when we add vectors that contain data.
Just as we don’t have to put any data into a data frame when we create it, like we just did, we don’t have to specify the content of the vector when we create it. We can do that beforehand, which makes our code look nicer. Let’s try this.
fnames = c("Ali", "Wei", "Julia", "Olaf")
lnames = c("Khouri", "Li", "Harrington", "Johnsson")
ages = c(72, 14, 25, 9)
cool_people = tibble(first_name = fnames,
last_name = lnames,
age = ages)
print(cool_people)
# A tibble: 4 x 3
first_name last_name age
<chr> <chr> <dbl>
1 Ali Khouri 72
2 Wei Li 14
3 Julia Harrington 25
4 Olaf Johnsson 9
Of the three ways of building a data frame from nothing that I’ve show you so far, this is the least intuitive way for beginners. It is also the most common way you’ll see it in the future. That’s because it allows you to separate creating the data from putting it in the data frame, which makes it nicer to read once you know what’s going on.
So, what’s going on?
First, we are creating three vectors called fnames
, lnames
and ages
and enter our values into them.
Then, we create a new data frame with the tibble()
function and say: “Hey, make me a column called first_name
, and put the values from fnames
into it. Then, make me a column called last_name
, and put the values from lnames
into it.” Etc.
With these different ways of creating data frames from scratch, you should be all set. If you have any questions, drop me an email and I’ll try to help.
Yes I’m that old in Internet years. Go look both of them up if you don’t know them. After you’re done here.↩︎