STAY WITH US

59. [Hindi]Machine Learning : .drop_duplicates() Method in Pandas| 2018 |Python 3

           
.drop_duplicates() Method in Pandas




59. [Hindi]Machine Learning : .drop_duplicates() Method in Pandas| 2018 |Python 3

       All right in this lesson I'll introduce the drop duplicate method which can be helpful for kind of using the same operations as the duplicated method offers us. But the drop triplicates method can be called on a data frame instead of a series. So it allows us to do some of those similar filtering operations in a few less lines of code. Let's begin by executing the code from the previous lesson. And I still have that sort values method in there in order to have the names in the first name column
sorted alphabetically. That just allows us to see some of the duplicates that pop up such as the first name right here of Erran that occurs at least three times in our data friend. So I'll actually begin by passing my DFI data frame a Python's built in Lenda function. This will give me the number of rows in this data frame as it currently is. And we see that we have a thousand rows.

               Now my logic here is that I'm thinking let's say I'm just coming into this brand new I have no what. No clue how. Drop duplicate works. I think that as soon as I call drop duplicates I'm going to have all of my duplicates removed. And so there is my brand new data frame. But let's say I pass it again into land to see how many rows are in my new db duped data frame. Well I'm actually going to get the exact same number. There are still a thousand rows given after the drop duplicates method has been called. Why is this. Well the reason is because it doesn't matter if a duplicate value occurs in a single column. For example we know for a fact that we have the same first names in the first name column. We know we have duplicates in gender. We know we have duplicates in senior management. We know we have duplicates in marketing right. It's only going to remove those rows however where these cells across all of the columns are identical in more than one row. And there's no two rows in this data frame where the first name and the gender and the start date and the last log in time and the salary and all of the other columns are equal. So we may have the same values across one column or two columns or even five columns but there's no two rows where the values are exactly identical. So no rows are being removed by default with this drop duplicate methods default arguments. We have to provide a little bit more customization in order to get what we want.

           So let's say I want to once again remove duplicate first names from my data frame. I'm going to use the that drop duplicates. There is an underscore in that method. All methods conclude with a set of parentheses. Let's take a look at our documentation with shift tab. Our very first parameter here is subset and it's set to none by default subset except a list of strings and those strings can represent the names of the columns in which we want to check for duplicates. So if I only enter first name it's only going to look through the values in my first name column. Now keep in mind we also have an additional parameter here called Keep and it functions exactly the same as it does on the duplicated method. And you can see it's default argument which I'm going to write out is set to first what that means is it's not going to move a row even if it occurs more than once even if it's a duplicate. It's going to keep the first occurrence of that value regardless. So for example if I take a look at this you can see that we're only going to have one error in value even though Aerin occurs more than once in our first name column. It is a duplicate value. The reason is because it's still keeping the first occurrence of error. That's why it's keep equals First keep the first occurrence of each value even if it is duplicate. Even if it is unique.


       So some of these names may actually occur legitimately once in our in our column. While some of these may occur multiple times but from Pandas perspective it's only looking at the first time it runs into it. And those are being marked as non duplicate. Conversely if we replace this with last you'll see that we'll still have Aeron present but it's going to keep the last occurrence of Aaron on Roe 937. So again we're not going to see any name more than once here because we have removed duplicates but we still kept the first or in this case the last occurrence of each duplicate. Now if we want to perform an operation where we remove a row if there is a duplicate in the first name column period we don't care if it occurs two times or five times if it occurs more than once we want those rows gone period. We can replace this last argument again with false. That's going to be a billion without quotes and now we'll see because Aaron and Adam and Allen and all those names occur more than once. It's not going to keep the first occurrence. It's not just going to keep the last occurrence. It's going to completely remove those rows because those names appear more than once in our original data frame. And here we again have a pretty short and data frame that has only those names that occur once period throughout our entire data friends.


        Now you have to be pretty careful here actually want to do. Another example that there'll be kind of funny if a value is completely non-unique what's going to happen is you're going to get a blank data frame. For example let's say I want to perform this job duplicate operation and I want to perform it on my team column and the team column actually does include null values. It does have those designations and the duplicate rules do apply to those now values as well if it has more than one null value. It's going to count them as the same thing. But let's say I use my keep. Equals false setting here.What's actually going to happen is we're going to get an empty data from. And the reason is because there is no single team value that only occurs once in that team column whether it's marketing or finance or business development. All those values occur more than once. There's no unique ones so that cheap argument is not going to keep any duplicate. It's going to completely wipe them out. So just keep that in mind whenever. If you if using this method and you're seeing these bizarre empty data. That's why. And just one more thing I want to emphasize if you want to sort or rather look at the duplicates across multiple columns for example I want to look at both first name and team. What I mean by that is I don't care if the first name is the same but the team is different. And I don't care if the team is the same but the first name is different. Only if those two values are the same and those two columns do I want to remove it. I can pass them on as a values within this comma separated list that I provide as the argument to my subset parameter. And what that's going to do is again it's going to keep Aaron and mail here. Now we're going to have two occurrences of Aaron because the team is different in each.


            In each case. So the other two Aron's that must have been removed must have had one of these teams as well. And that's why those rows have been removed. So it's going to perform that duplicate dropping operation while looking at the values across two columns and it's not limited to two by the way you can continue adding commas here and additional values if you want to look at duplicates across three columns or four columns. And as always the drop duplicates method does not modify the original data frame. If we want to do that it does have an in place parameter we can set that to true if we want to overwrite it. And then we're going to have our D duped DFA data frame which we can prove is definitely Alli's shorter than our original one so we can confirm that at least some rows have been removed. So that's the drop duplicates method. Be sure to watch out for a lot of the little tricks and traps that lie with this method. Keep in mind you have to usually use that subset parameter to specify what you want to look for. Remember that that keep parameter can affect what you're going to get back by itself it's not going to remove any all the rows that have duplicates you have to put false for that if you want to strictly remove those rows.

    So just watch out and be careful when you're playing around with this method.


Code Link : ML_59

Code :

#!/usr/bin/env python
# coding: utf-8

# In[4]:


import pandas as pd

df = pd.read_csv("employees.csv", parse_dates=["Start Date","Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.sort_values("First Name", inplace=True)
df.head()


# In[6]:


len(df)


# In[7]:


len(df.drop_duplicates())


# In[9]:


df.drop_duplicates(subset=["First Name"], keep="first")

len(df.drop_duplicates(subset=["First Name"], keep="first"))


# In[11]:


df.drop_duplicates(subset=["First Name", "Team"], keep="last").head()


# In[17]:


df.drop_duplicates(subset=["First Name", "Team"], inplace=True)

df.head()

YouTube Link :


Post a Comment

0 Comments