TSim: a system for discovering similar users on Twitter

AlMahmoud, Hind; AlKhalifa, Shurug

doi:10.1186/s40537-018-0147-2

Journal of Big Data

Table 2 Brief descriptions of the map and the reduce functions used in processing each signal

From: TSim: a system for discovering similar users on Twitter

MapReduce job	Map function	Reduce function
Signal 1: Followings and followers relationship similarity	It takes in the examined user ID and each of his/her following and followers It simply produces pairs of (follower/following user id, “1”)	The input will be every follower/following user ID and a list of “1”s depending on how many times this user id appeared in the different lists The reduce adds up these “1”s to produce the follower/following user ID along with the sum of these ones, which is its score on this signal
Signal 2: Mention similarity	It takes in the examined user ID and each of his/her tweet threads It extracts the user IDs in these tweets (preceded by @ symbol) It calculated the score for each user ID based on the formula in Table 1 It outputs each user ID along with its score	The input will be every user ID mentioned in the tweets of the examined user along with a set of scores for each thread this user was mentioned in The reduce adds up these scores to produce the mentioned user ID along with the sum of these scores, which is the user’s score on this signal
Signal 3: Retweet similarity	It takes in the examined user ID and each of his/her retweets It simply produces pairs of (original tweeter user id, “1”)	The input will be every user ID the examined user has retweeted their tweets and a list of “1”s depending on how many times the examined user retweeted for this particular user The reduce adds up these “1”s to produce the retweeted user ID along with the sum of these ones, which is its score on this signal
Signal 4: Favorite similarity	It takes in the examined user ID and each of his/her favorited tweets It simply produces pairs of (original tweeter user id, “1”)	The input will be every user ID the examined user has favorited their tweets and a list of “1”s depending on how many times the examined user favorited for this particular user The reduce adds up these “1”s to produce the favorited user ID along with the sum of these ones, which is its score on this signal
Signal 5: Common hashtags similarity	It takes in the candidate user ID and each of his/her tweets that have the hashtag symbol (#) It compares the sentiment of tweets against the sentiment of the examined user’s tweets in the same hashtag (obtained in preprocessing) using the formula in Table 1. (HTOffset) It produces (candidate ID, Hashtag + score)	The reduce function will receive a candidate user ID and a list of pairs of hashtags and scores It will loop through this list and sum the scores with the same hashtag Then it will use the similarity formula in Table 1 to compute the final score for each candidate Produce candidate ID and score
Signal 6: Common interests similarity	It takes in the candidate user ID and a list of his/her tweets Applies LDA to get the top 5 interests Computes the score after comparing with the examined user’s top 5 interests (obtained in preprocessing) according to the formula in Table 1 Produce (candidate ID, score)	The Reduce function simply takes the input and passes as output
Signal 7: Profile similarity	It takes in the candidate user ID and his/her profile info Computes the score after comparing with the examined user’s gender, location and language (obtained in preprocessing) according to the formula in Table 1 Produce (candidate ID, score)	The Reduce function simply takes the input and passes as output
Mid and final MapReduce	Takes in the candidate user ID along with his/her score Produces (candidate ID, signal weight + score)	The reduce function will receive a candidate user ID and a list of pairs of signal weights and scores It will loop through this list and sum the scores with the same weight Then it will multiply the summed up score by the associated weight and sums up the weighted sums to produce the score for that candidate Produce candidate ID and score

Back to article page