From: TSim: a system for discovering similar users on Twitter
MapReduce job | Map function | Reduce function |
---|---|---|
Signal 1: Followings and followers relationship similarity | It takes in the examined user ID and each of his/her following and followers It simply produces pairs of (follower/following user id, “1”) | The input will be every follower/following user ID and a list of “1”s depending on how many times this user id appeared in the different lists The reduce adds up these “1”s to produce the follower/following user ID along with the sum of these ones, which is its score on this signal |
Signal 2: Mention similarity | It takes in the examined user ID and each of his/her tweet threads It extracts the user IDs in these tweets (preceded by @ symbol) It calculated the score for each user ID based on the formula in Table 1 It outputs each user ID along with its score | The input will be every user ID mentioned in the tweets of the examined user along with a set of scores for each thread this user was mentioned in The reduce adds up these scores to produce the mentioned user ID along with the sum of these scores, which is the user’s score on this signal |
Signal 3: Retweet similarity | It takes in the examined user ID and each of his/her retweets It simply produces pairs of (original tweeter user id, “1”) | The input will be every user ID the examined user has retweeted their tweets and a list of “1”s depending on how many times the examined user retweeted for this particular user The reduce adds up these “1”s to produce the retweeted user ID along with the sum of these ones, which is its score on this signal |
Signal 4: Favorite similarity | It takes in the examined user ID and each of his/her favorited tweets It simply produces pairs of (original tweeter user id, “1”) | The input will be every user ID the examined user has favorited their tweets and a list of “1”s depending on how many times the examined user favorited for this particular user The reduce adds up these “1”s to produce the favorited user ID along with the sum of these ones, which is its score on this signal |
Signal 5: Common hashtags similarity | It takes in the candidate user ID and each of his/her tweets that have the hashtag symbol (#) It compares the sentiment of tweets against the sentiment of the examined user’s tweets in the same hashtag (obtained in preprocessing) using the formula in Table 1. (HTOffset) It produces (candidate ID, Hashtag + score) | The reduce function will receive a candidate user ID and a list of pairs of hashtags and scores It will loop through this list and sum the scores with the same hashtag Then it will use the similarity formula in Table 1 to compute the final score for each candidate Produce candidate ID and score |
Signal 6: Common interests similarity | It takes in the candidate user ID and a list of his/her tweets Applies LDA to get the top 5 interests Computes the score after comparing with the examined user’s top 5 interests (obtained in preprocessing) according to the formula in Table 1 Produce (candidate ID, score) | The Reduce function simply takes the input and passes as output |
Signal 7: Profile similarity | It takes in the candidate user ID and his/her profile info Computes the score after comparing with the examined user’s gender, location and language (obtained in preprocessing) according to the formula in Table 1 Produce (candidate ID, score) | The Reduce function simply takes the input and passes as output |
Mid and final MapReduce | Takes in the candidate user ID along with his/her score Produces (candidate ID, signal weight + score) | The reduce function will receive a candidate user ID and a list of pairs of signal weights and scores It will loop through this list and sum the scores with the same weight Then it will multiply the summed up score by the associated weight and sums up the weighted sums to produce the score for that candidate Produce candidate ID and score |