Introduction
- R installation
- Working directory
- Getting help
- Install packages
Data structures
Data Wrangling
- Sort and order
- Merge data frames
Programming
- Creating functions
- If else statement
- apply function
- sapply function
- tapply function
Import & export
- Read TXT files
- Import CSV files
- Read Excel files
- Read SQL databases
- Export data
- plot function
- Scatter plot
- Density plot
- Tutorials Introduction Data wrangling Graphics Statistics See all
Filter rows in R with dplyr
The filter function from dplyr subsets rows of a data frame based on a single or multiple conditions. In this tutorial you will learn how to select rows using comparison and logical operators and how to filter by row number with slice .
Sample data
The examples inside this tutorial will use the women data set provided by R. This data set contains two numeric columns: height and weight .
Filtering rows based on a single condition
The filter function allows to subset rows of a data frame based on a condition . You can filter the values equal to, not equal to, lower than or greater than a value by specifying the desired condition within the function.
The following table contains the comparison operators in R and their descriptions.
For example, if you want to filter the rows where the height column is greater than 68 you can write the following:
The filtering can be based on a function . The following example selects the rows of the data frame where the height is equal or lower to the mean of the column.
It is also possible to filter rows using logical operators or functions that return TRUE or FALSE or a combination of them. The most common are shown in the table below.
Consider that you want to filter the rows in which the height column takes the value 65, 70 and 72. For this you can use the %in% operator and filter the rows by a vector.
The opposite of a condition can be selected with the logical negation operator ! . The example below shows how to select the opposite of the filtering made on the previous code.
To filter rows containing a specific string you can use grepl or str_detect . The following example filters the rows containing a specific pattern (e.g. rows of height containing a 5).
Multiple conditions
Row filtering can also be based on multiple conditions to filter, for instance, rows where a value is in a specific range or to filter between dates. For this you will need to use logical operators, such as & to specify one AND another condition , and | to specify one OR another condition .
The example below selects rows whose values in the height column are greater than 65 and lower than 68.
The multiple conditions can be based on multiple columns . In the following block of code we are selecting the rows whose values in height are greater than 65 and whose values in weight are lower or equal to 150.
In case you need to subset rows based on a condition OR on another you can use | . The example below filters the rows whose values in height area greater than 65 or whose values in weight are greater or equal to 150.
Filter by row number with slice
A similar function related to filter is slice , which allows to filter rows based on its index/position . The function takes a sequence or vector of indices (integers) as input, as shown below.
In addition, the slice_head function allows to select the first row of the data frame. This function provides an argument named n to select the n first rows.
Finally, if you need to select the last row you can use slice_tail . This function also provides an argument named n to select the last n rows of the data frame.
slice_sample selects rows randomly and slice_min and slice_max selects the rows with the lowest or highest values of a variable, respectively.
Explore and discover thousands of packages, functions and datasets
Learn how to plot your data in R with the base package and ggplot2
PYTHON CHARTS
Learn how to create plots in Python with matplotlib, seaborn, plotly and folium
Related content
Rename columns in R with the rename() function from dplyr
Data Manipulation in R
The rename() function from dplyr can be used to alter column names of a data frame. In addition, rename_with() allows to rename columns using a function
Select columns in R with dplyr
Select or remove columns from a data frame with the select function from dplyr and learn how to use the contains, matches, all_of, any_of, starts_with, ends_with, last_col, where and everything functions
Create statistical summaries in R with the summarise() function from dplyr
The summarise (or summarize) function is used for aggregating (along with group_by) and summarizing data creating a new data frame with the specified summary statistics
Try adjusting your search query
👉 If you haven’t found what you’re looking for, consider clicking the checkbox to activate the extended search on R CHARTS for additional graphs tutorials, try searching a synonym of your query if possible (e.g., ‘bar plot’ -> ‘bar chart’), search for a more generic query or if you are searching for a specific function activate the functions search or use the functions search bar .
Basic Analytics in R
Lesson 4 filtering data.
We often want to “subset” our data. That is, we only want to look at data for a certain year, or from a certain class products or customers. We generally call this process “filtering” in Excel or “selection” in SQL. The key idea is that we use some criteria to extract a subset of rows from our data and use only those rows in subsequent analysis.
There are two ways to subset data in R:
- Use R’s built in data manipulation tools. These are easily identified by their square bracket [] syntax.
- Use the dplyr library. Think of dplyr as “data pliers” (where pliers are very useful tools around the house).
I personally find dplyr much easier to use than the square bracket notation, so that is what we will use.
4.1 Preliminaries
4.1.1 import data.
Import the Bank data in the normal way in R Studio. You can either use Tools –> Import Dataset from within R Studio or run the command line version of the import functions from the tidyverse. I typically use the menu the first time but then save the command line version created by R Studio.
Click on the Bank tibble in the panel at the top right of R Studio to inspect the contents of the imported file.
4.2 Filters
4.2.1 using a logical critereon.
The easiest way to filter is to call dplyr’s filter function to create a new, smaller tibble: <new tibble> <- filter(<tibble>, <critereon>)
For example:
The new tibble is called FemaleEmployees (although you can call it anything). The source tibble is, of course, the Bank tibble. The logical criterion is Gender=="Female" . A few things to note about the logical criterion:
- Gender is the name of a column in the Bank tibble.
- The logical comparison operator for equals is == , not = . This is the convention in many computer programming languages in which the single equals sign is the assignment operator. In R, <- is the assignment operator and == is the equals comparison operator. If you make a mistake in filtering, it is almost always because you use = instead of == .
- “Female” is a literal string. It means: Only keep rows in which the value of Gender is exactly equal to “Female”. The string “female” is not close enough. A literal string is a literal string.
4.2.2 Filtering Using a List
One very powerful trick in R is to extract rows that match a list of values. For example, say we wanted to extract a list of managers. In this dataset, managers have a value of JobGrade >= 4, so we could use a logical criterion:
Note that there is no assignment operator here, so I have not created a new tibble. R simply summarizes the results in the console window.
The problem with this approach is that it requires job grades to be numeric (and thus ordinal). I could accomplish the same thing in a more general way using a list of the job grades I want to include:
- Create a new vector of managerial job grades using the “combine” function, c() . I call the resulting vector “Mgmt”.
- Use the is.element() function to test membership in the list for each employee. The full syntax is: is.element(x, y) . The function returns TRUE if x is a member of y and FALSE otherwise.
I did not have to put the members of Mgmt in quotation marks because JobGrade is an integer. If my list contains text I have to use quotation marks:
Animals <- c("cat", "dog", "horse", "pig")
4.3 Syntatic sugar
Many computer languages offer “syntactic sugar”: shortcuts that make long or complex commands a bit easier to type. The tidyverse packages offers a couple of sweeteners. The important thing to remember about these shortcuts is that they (generally) only work in tidyverse packages.
4.3.1 Membership
Instead of remembering the syntax of is.element(x, y) , you can use the alternative %in% . This makes the filter syntax a bit more readable. As you see from the output, the results are identical to the un-sweetened version.
4.3.2 Pipes
Pipes are use to solve the problem of nested function calls. A nested function occurs whenever the argument of f() is itself a function g() . As you have probably discovered, it is hard to keep the parentheses straight when you write long statements of the form: f(g(x)) .
A pipe takes the result of the interior function then pass it along to the exterior function. So f(g(x)) can be rewritten using a pipe: g(x) %>% f . This can be helpful for very long, multi-line statements in R. Just read the pipe operator %>% as “THEN”.
To illustrate, start with the tibble Bank THEN filter it THEN view it:
IMAGES