STAY WITH US

53. [Hindi]Machine Learning : Filtering a Dataframe Based on a Condition| 2018 |Python 3

Filtering a Dataframe Based on a Condition



 53. [Hindi]Machine Learning : Filtering a Dataframe Based on a Condition| 2018 |Python 3

       All right in this lesson will finally dive into filtering our data frames. And when I say filter I mean extracting those rows that meet a certain condition. So let's begin by executing all of our code from the previous lesson and taking a look at the first three rows of our data frame. So here we have our fake list of employees and let's say I want to put all of the male employees in our company and in order to do that I have to pull those rows where the value in the gender column is equal to male. So to begin the filter process I have to focus on a single series and that's going to be the series where I'm going to be checking for a condition. 

            So in this example I have to begin by selecting our gender series so I can do that by writing DMF and then using my square brackets to extract and then writing gender in there. And there we have our series now in order to compare it. All of the values in a series to a specific value. And this is going to be a broadcasting operation. I'm actually going to use the python sign the language sign for comparison which is the double equal sign. Now you have to be very careful here because if you accidentally use the single equal sign what it's going to do is assign the values on the right to the values in that series.

            So for example what I want to do here is compare to mail compare the values in the gender column to mail. If I accidentally forgot that second equals it's simply going to overwrite all of the values in the gender column with that string value of mail. So just be careful and remember that you always want those double equals or comparison. And what this is going to return to us is a brand new series and it's going to be a boolean series. It's going to be a series that only has trues and falses and those trues and falses will reflect the accuracy of the condition. So for example in the first two rows we have males and because it is equal to male. Those two rows are going to return true. However when it starts comparing the value in row to or rather the row with the index position of two. we have female which is not going to be equal to male. 

So the evaluation for that row is going to return false.

     Let's take a look. So when I execute this I'm going to get a series of booleans and those billions reflect whether the condition that I'm evaluating is valid or not true or false. Now this is where it gets a little bit funky. In order to extract these specific rows ones where the gender column is male or in other words where this is true. In this new boy and series I have to pass the entire code into the square brackets following my data frame.


       So what I'm going to do is write decaff here which is my data frame and then I'm going to put those square brackets. We use those in the previous lesson to extract single columns. It's the exact same syntax but now within those square brackets I'm going to pass the code that I just wrote. Now what it's doing is it's operating on that internal code first and that's generating the Boolean series that we see below whenever we pass a Boolean series within those square brackets panderers knows that they own. We want it to only extract those rows that have a true value. And in this case those rows with a true value will be those that have male.

            So now if I execute this code we're going to get a new data frame where you can see that it's only taken the rows that have male so you can see as I scroll down there's no females present. Let's do another example just to get as familiar with the syntax. So I'm going to hide this right here by double clicking. Let's say I want to pull every person who's on the team of Finance I can see one example here again again I have to begin by focusing on the specific areas that I want to perform a test on. Let's say I want to extract the team series and now I want to compare every value within this series to the word finance. So I'm going to use the double equal sign again. That's the literal comparison. It's going to check whether every single value here is equal to finance. And if it is as we see here in rows 2 and 3 it's going to return true for everything else including null values. It's going to return false because those values are not equal to finance. Now is also not equal to finance. So now that we have this boolean series of trues and falses if we pass this within square brackets right after our name of our data frame which is D.S. we're going to get that data frame where the values in the team column are equal only to finance.



            Now the reason the syntax is a little bit funky is because we're going to have the data from nameless to twice We're going to have the outside data frame variable. And then the square brackets.
And then inside we're using that data frame variable again to extract one of the series. So we're going to have DFA bracket. That syntax can be a little bit confusing. So there is is a slightly more elegant way to do this if you don't like the syntax. What we can do is assign the resulting boolean series. That's the first thing we generate here to it's own separate variable. So let me give you an example. I'm once again going to repeat my code. I'm going to begin by extracting the team series which is where I want to check my conditions. I'm now going to make it equal to finance. Or rather I'm going to check whether the values are equal to finance which is my string value here.


             I'm going to get my Boullion series and here I'm going to assign this to a variable and usually a very common variable name for this is mask. But you can use whatever you'd like. Sometimes I use extract. Sometimes I do condition. So now we have mass which is storing the boolean series that's generated by this code. And now after DSF and my square brackets instead of copying all the code that we have here which is what we did in the previous examples I can just write in mask which represents that boolean series. And I'm going to get the exact same result which is a data frame where the values in the team column are all finance.

          Let's do another example and this one is going to be a little bit trickier but if you're smart you can figure it out pretty quickly. Let's say I want to extract the rows from our senior management column where we have a true. So wherever we do have a senior management employee so things like this. I want to extract the zeroes if I begin by pulling out my senior management column we can see that it already consists of boolean values. It's already a boolean series. So while we could do something like compare these values to true we don't have to because if we do this is going to return the exact same series where ever it's true is going to say oh true is equal to true. That must be true. As long as we have a boolean series of any type and we have one right here we can pass that directly in our square brackets and pandas will return only the rows where there is a true value in the senior
management column as we can see here.


                  Let's do a couple of more examples. Let's talk about filtering based on a condition not being met. So let's say I want to pull any person who is not on the team of marketing. Once again I'm going to begin by pulling the team column and this time instead of using the double equal sign to compare. Literally I'm going to use the not equal comparison which is an exclamation point followed by an equal sign. And I'm going to check whether it's not equal to marketing. So any time we run into marketing it's going to return a false. Now because the value is not equal to marketing it's false because the value in fact is equal. So now you can see here where we have marketing on the first row.
We're going to get a false and every other value below including the other teens and the values are
going to return true because those values are not equal to marketing.

                  Now if I have my condition I can either feed it into those square brackets right here. That's one way of doing it or I can simply assign this bullion series to its own separate variable and pass that variable in between my square brackets. Let's do a couple more examples. Instead of using the comparison and non comparison or non quality operator lets use greater than and less than this time let's pull everybody who's salary is greater than 110000. So you can see here we have a salary column. That's the one that we're going to apply our condition on. So we have to begin by extracting it with DMF bracket salary. And now instead of using either the double equals to compare literally to an exact salary or not equal to compare not against the specific salary I'm just going to use the greater than sign and give it a value of 110000. So now it's going to return another other series of trues and salsas and wherever we see true the salary is greater than 110000. And where ever we see false we know the salary is not greater than 110000. Now that we have that condition we can wrap that inside our square brackets and feed that right after the name of our data frame which is D.S. and now I felt it for the rows where the salary is greater than 110000 let's do yet another example let's do an example where we do less than let's pull any employee who has a bonus percentage less than 1.5. I'll begin by extracting my bonus percentage column is going to type it out and then press tab. So there's my bonus percentage column of floats.



        I'm going to use the less than sign in order to compare to those that are less than 1.5 wherever the value in the Sierras is less than 1.5. I'm going to get a true and a false otherwise. And now if I wrap that in square brackets and write the data frame in front of it d f. Now I have extracted those rows where the bonus as we can see is less than 1.5. And the great thing about this as well is that this less than or greater than Centex also works with dates and this will be our last example. Let me collapse this. Let's say we want to pull all employees who started before or on January 1st 1985. So we have a start date column in our data frame. Let's begin by extracting it. So I'm going to do DMF square brackets start date and there we have our series of dates. And now I'm going to use the exact same syntax less than but in order to make it less than or equal to I can do this syntax less than or equal to with the less than sign immediately followed by the equal sign. And I can actually write in the date exactly as a string. So I'm going to mirror the format we at the end. And I'm just going to enter my date.



             So I want to compare it to January 1st 1985 which is 1985-01-01 that's going to return a boolean series which is going to be true if the date is prior to or on that date and they false. If it does not fall before January 1st 1985. Now that I have that Boullion series I can assign it to a variable like mask. And finally after the name of my data frame which is DMF I can write those square brackets feed that variable that represents my Boullion series and take a look and there as you can see the start date in every single column here is less than or equal to January 1st 1985. So here we have all of our 1980s employees as we scroll down we can see that all of these dates do fall before. So we do have the comparison operator with the double equals the non quality operator which is the exclamation point and the equals. We have less than less than or equal to or greater than and greater than an equal to. And you can use all of these to set up your own boolean series and as long as you feed in that boolean series into those square brackets following the data frame name it's going to extract only those rows that fit that condition or in other words only those rows where the boolean series has a true. For that specific row. And if that syntax gets a little bit too inelegant or a little bit too ugly due to the multiple square brackets and the multiple names of the data frames you can always store that boolean series in its own separate variable and simply feed that variable into the square brackets after the data frame name.

#!/usr/bin/env python
# coding: utf-8

# In[3]:


import pandas as pd

df = pd.read_csv("employees.csv", parse_dates=["Start Date","Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.head()


# In[6]:


df["Gender"] == "Male"

df[df["Gender"] == "Male"].head()


# In[9]:


df["Team"] == "Finance"

df[df["Team"] == "Finance"].head()


# In[11]:


mask = df["Team"] == "Finance"
df[mask]
df[mask].head()


# In[13]:


df[df["Senior Management"]].head()


# In[16]:


df[df["Team"] != "Marketing"]
df[df["Team"] != "Marketing"].head()


# In[19]:


df["Salary"] > 110000

df[df["Salary"] > 110000].head()


# In[22]:


df["Bonus %"] < 1.5

df[df["Bonus %"] < 1.5].head()


# In[24]:


mask = df["Start Date"] <= "1985-01-01"

df[mask]




So that's how we Filter a data frame based on a single condition. And in the next lesson we'll expand on this topic to talk about how we can filter with more than one condition.





Post a Comment

0 Comments