.Duplicated() Method in Pandas |
58. [Hindi]Machine Learning : .Duplicated() Method in Pandas| 2018 |Python 3
All right in this lesson I'll introduce another method that we can call on a panel series to get us a billion series. And that's the duplicated method and it allows us to extract the rows from a data frame that are duplicates. So I'll begin by re executing our code from the previous lessons. And I'm actually going to add one more line here. So we currently do have duplicates in this data set but they're scattered throughout and I'm going to be focusing on the first name column here. So what I'm going to do is sort of values in this data frame by the values in the first name column or series. And I'm going to make that operation in place. Just so we permanently modify the data frame. And so here you can see I have a bunch of duplicates immediately showing up here.
I have multiple people with the same first name of Aaron and so on. Let's actually take a look at that series. So I'm going to extract it with my bracket syntax. And there we see we have all the Arron's all the atoms all the Alberts and so on and there are some names in this in this first name column that are unique in which case they're only going to have a single row. So the method I'm going to introduce here is duplicated and this is a tricky one. It takes a little bit of time to get comfortable with so I'll try to explain this as best as I can. So the way that the duplicated method works by default by default it has this parameter called keep and what keep is going to do is it's going to mark the very first occurrence of each value as a non duplicate. So what that means is when we proceed downwards through the set it's going to take a look at Aaron here and say I haven't seen Aaron at all before. It's my first time running into this value. Therefore I'm going to not count it as a duplicate. It's then going to run into these three airn values right below. And these are the ones that are going to be marked as duplicates. So when it's looking at a series it's not marking Ayro as a duplicate. If it exists more than once rather it's only marking each subsequent occurrence after the first one as a duplicate.
So this Adam here again this one is going to be marked as unique. It's not going to be viewed as a duplicate row even though there are other atoms in the series rather all the other atoms. After the first two are going to be marked as a duplicate. So just keep in mind watch these first four rows for Aaron and you'll see that the very top one is going to be false because it is not a duplicate. And the next three are going to be true because they are duplicated. There we have it. Our very first occurrence of Aaron is going to be marked as false Panas does not view it as a duplicate while the other rows are viewed as duplicates. And if we pass this Boullion series into our square brackets we're going to be able to extract those duplicate rows. So keep in mind we can see three Arendt's here as the first example that first ever in row even though it is technically a duplicate is not being counted as one. So it's only going to show us the rows after that one so the three errands that occur after our first
occurrence of Aaron. And similarly if I erase all this and come back to our original series there is an additional parameter that we can provide or modify on the duplicated method. And as you may recall we had that key parameter and it was set to first. And what that means is whenever it runs into the first time of any value that's not going to be marked as a duplicate.
That's a regular value. If I changed that keep to last what it's going to do is view this Aaron as a unique value or not not a duplicate. And you the first three as the duplicate values. So it's going to wait until it arrives at the very last value of the series that is unique so to speak in Panda's mind. And that's the one that's going to be marked as unique. You can think of this almost as proceeding from the bottom of the series upwards so it's going to start at the very bottom it's going to be proceeding up here coming up from the bottom. It's going to run into this Aaron and say this Aaron I haven't seen that before. It's not a duplicate. Then it's going to run into these three errands and say oh these have already occurred before therefore I'm going to mark them as duplicates. So as you can see we do have the first three Arendt's here marked as true true for yes it is duplicated and we have false for the last Aaron. Or rather the last Aaron. And if I extract this will again see three different Aerin values. But one of them is going to be different it's going to be swapped. Now you may ask probably the first question I asked when I was looking at this which is what if I want to remove just all rows that are duplicate. I don't care if it's the first time or the second time. As long as it's repeated more than once I want it gone. So Aaron in this case occurs four times it is a duplicate. I want to remove all of these rows that have duplicated values.
In that case when we call the duplicated method to return a series the keep parameter accepts an additional argument and instead of first or last as a string we can enter false. Now what Fallsville do is it's going to mark something as duplicated if it occurs more than once. Period. Doesn't matter if it's the first time or the last time. Here we have an example or we have all Arron's all of these are going to be marked as true true for yes it is a duplicate because error occurs more than once. So now that we have a list of all of the duplicates in the data frame if we extract it what we can do here is extract only the rows that have more than 1 duplicate value in the first column. So these are all the duplicates.
You can see that the Aron's are represented four times and the Aaron name does occur first four times. Now another common operation is what if you want to remove all of the rows that are unique. So I want to only have the rows where there are no duplicate first names just the unique names they have to occur only once. So this is a little bit trickier but I have here my original Boolean series that happened when I call the duplicated method. And as you know it returns a true for all duplicate values. And whenever we pass a Boolean series to our brackets it's going to extract all those rows where we have a true here. Now if we can they gave these values. In other words if we could turn the Trews here to false and the false to trews keeping in mind that the false in this current balance series represent the unique values because they are not duplicated if those forces were to become trews and we extract those rows that would allow us to get the unique values. Now there is a special symbol that we can put in front of a series of Boolean series to negate it or convert it the other way. So all of the True's will become false and all of the false will become Trews. That symbol is called the tilde a symbol t i l d and it's located to the left of your one on the keyboard. It looks something like this that little squiggly line that's placed kind of in the middle of the self.
Now when I execute this you'll see that it's going to reverse all of the True's to false and all of the false to Trews. So now the truth that we have in this Boolean series represent the unique values the values that only occur once in the first name column. If I finally either you know assign this to a variable or pass it directly to my square brackets let's assign it to a variable because it's a little bit cleaner. Now I have that Boolean series in mask if I open my square brackets and pass mask you'll see that all of these names are only going to be represented once. And this is the one and only time that each of these names occurs in our original data frame. So Angela does not occur more than once. It only happens once on row 8. And same goes for Brian and Carol and David and Dennis and so on you can see out of our thousand employees there's actually a pretty small number of employees that have a unique name something like nine here. So that's the duplicated method. It can be used to return a Boolean series that specifies whether the value is a duplicate or not. Keep in mind the default option which has the first argument for the key parameter will mark any subsequent occurrence of each value as a duplicate. But it will not mark the first occurrence as a duplicate.
So you're probably more often than not going to be using this. Keep keep equals false argument in order to extract those that are all duplicates or negate it with the till that to extract all unique values. So that's an introduction to the duplicated method much like the is now or not. Now methods it can be particularly helpful. Whatever we want to create Boolean series for the purposes of extraction and in the next lesson we'll dive into yet another method to drop duplicate method which will allow us to remove duplicates in a slightly more easier operation.
I'll see you there.
Code Link : ML_58
Code :
#!/usr/bin/env python
# coding: utf-8
# In[6]:
import pandas as pd
df = pd.read_csv("employees.csv", parse_dates=["Start Date","Last Login Time"])
df["Senior Management"] = df["Senior Management"].astype("bool")
df["Gender"] = df["Gender"].astype("category")
df.sort_values("First Name", inplace=True)
df.head()
# In[10]:
df["First Name"].duplicated()
df[df["First Name"].duplicated()].head()
# In[15]:
df["First Name"].duplicated(keep = 'last')
df[df["First Name"].duplicated(keep = 'last')].head()
# In[18]:
df["First Name"].duplicated(keep = False)
df[df["First Name"].duplicated(keep = False)].head()
# In[19]:
mask = ~df["First Name"].duplicated(keep = False)
df[mask]
YouTube Link :
0 Comments