7

I have the following data:

userid itemid
  1       1
  1       1
  1       3
  1       4
  2       1
  2       2
  2       3

I want to drop userIDs who has viewed the same itemID more than or equal to twice. For example, userid=1 has viewed itemid=1 twice, and thus I want to drop the entire record of userid=1. However, since userid=2 hasn't viewed the same item twice, I will leave userid=2 as it is.

So I want my data to be like the following:

userid itemid
  2       1
  2       2
  2       3

Can someone help me?

import pandas as pd    
df = pd.DataFrame({'userid':[1,1,1,1, 2,2,2],
                   'itemid':[1,1,3,4, 1,2,3] })
1
  • 1
    Please post reproducible code that creates that data.
    – smci
    May 1, 2017 at 3:23

5 Answers 5

11

You can use duplicated to determine the row level duplicates, then perform a groupby on 'userid' to determine 'userid' level duplicates, then drop accordingly.

To drop without a threshold:

df = df[~df.duplicated(['userid', 'itemid']).groupby(df['userid']).transform('any')]

To drop with a threshold, use keep=False in duplicated, and sum over the Boolean column and compare against your threshold. For example, with a threshold of 3:

df = df[~df.duplicated(['userid', 'itemid'], keep=False).groupby(df['userid']).transform('sum').ge(3)]

The resulting output for no threshold:

   userid  itemid
4       2       1
5       2       2
6       2       3
2
  • 1
    What if I want to drop regarding a threshold? That is, what if I want to drop userids with more than 3 duplicates?
    – Mansumen
    May 1, 2017 at 3:26
  • 1
    This is a special-case trick for THRESHOLD = 2. For larger THRESHOLD, you need @DYZ's more general code.
    – smci
    May 1, 2017 at 3:30
7

filter

Was made for this. You can pass a function that returns a boolean that determines if the group passed the filter or not.

filter and value_counts
Most generalizable and intuitive

df.groupby('userid').filter(lambda x: x.itemid.value_counts().max() < 2)

filter and is_unique
special case when looking for n < 2

df.groupby('userid').filter(lambda x: x.itemid.is_unique)

   userid  itemid
4       2       1
5       2       2
6       2       3
0
3

Group the dataframe by users and items:

views = df.groupby(['userid','itemid'])['itemid'].count()
#userid  itemid
#1       1         2 <=== The offending row
#        3         1
#        4         1
#2       1         1
#        2         1
#        3         1
#Name: dummy, dtype: int64

Find out who saw any item only once:

THRESHOLD = 2
viewed = ~(views.unstack() >= THRESHOLD).any(axis=1)
#userid
#1    False
#2     True
#dtype: bool

Combine the results and keep the 'good' rows:

combined = df.merge(pd.DataFrame(viewed).reset_index())
combined[combined[0]][['userid','itemid']]
#   userid  itemid
#4       2       1
#5       2       2
#6       2       3
2
# group userid and itemid and get a count
df2 = df.groupby(by=['userid','itemid']).apply(lambda x: len(x)).reset_index()
#Extract rows where the max userid-itemid count is less than 2.
df2 = df2[~df2.userid.isin(df2[df2.ix[:,-1]>1]['userid'])][df.columns]
print(df2)
   itemid  userid
3       1       2
4       2       2
5       3       2

If you want to drop at a certain threshold, just set

df2.ix[:,-1]>threshold]
0

I do not know whether there is a function available in Pandas to do this task. However, I tried to make a workaround to deal with your problem.

Here is the full code.

import pandas as pd
dictionary = {'userid':[1,1,1,1,2,2,2],
              'itemid':[1,1,3,4,1,2,3]}

df = pd.DataFrame(dictionary, columns=['userid', 'itemid'])

selected_user = []

for user in df['userid'].drop_duplicates().tolist():

    items = df.loc[df['userid']==user]['itemid'].tolist()
    if len(items) != len(set(items)): continue
    else: selected_user.append(user)

result = df.loc[(df['userid'].isin(selected_user))]

This code will result the following outcome.

    userid  itemid
4   2       1
5   2       2
6   2       3

Hope it helps.

1
  • for user in df['userid'].drop_duplicates().tolist() can always be replaced by df.groupby('userid')
    – smci
    May 1, 2017 at 7:18

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Not the answer you're looking for? Browse other questions tagged or ask your own question.