.
B. Retweet Times of RTI
When the lifespan of a tweet is maintained for more than
j
minutes, we equally divide
j
into
p
time units. RTI indicates the number of retweets occurring in a given time unit
p
. We can convert
RTI
(
ti
) into a numeric vector as follows:
RTI( t i )= .
V. Framework for Predicting Tweet Popularity
We propose a framework for predicting the popularity of a tweet. As shown in
Fig. 3
, the framework is partitioned into two phases. In the generation phase of the prediction knowledge bases, using the Hadoop cluster, we first remove the tweets posted by a spammer and make a retweet graph. We then extract the features that are used to make predictions from the retweet graph, and based on the tweet creation time we generate knowledge bases. In the prediction of the tweet popularity phase, a given target tweet is entered into the knowledge base.
N
tweets—each having a similar topic with the target tweet—are then extracted. From the knowledge base, only
M
tweets among
N
—that have an analogous retweet pattern—are passed on to the next processing stage. We extract
T
tweets with a similar property to that of the target tweet. Finally, we predict the popularity of tweets based on the knowledge extracted from
T
tweets.
Framework for predicting tweet popularity.
- 1. Prediction Knowledge Base Algorithm
We extract the features described in section IV from previous popular tweets. Algorithm 1 shows the generative processes of the prediction knowledge bases.
Algorithm 1 Generating prediction knowledge bases Input S: a set of seed tweets SL: spammer list TWEET: tweet collection Output KBi: knowledge bases 1: T = MapReduce(TWEET, SL) 2: for seedt in S 3: enQueue(tweet_id of seedt) 4: while queue_size > 0 do 5: tweet_id = deQueue() 6: for rtgi in T 7: if tweet_id = name of rtgi then 8: addGraph(seedt, rtgi) 9: enQueue(each tweet_id of rtgi) 10: end if 11: end for 12: end while 13: featurei = extractFeature(seedt) 14: store(featurei, KBi) 15: end for
A. Data collection
Only tweets written in Korean were included in this study, since the posting time changes depending on the time zone of users. We collected the tweets using a Twitter Stream API that allows the tweets to be searched based on keywords or strings.
B. Data Preprocessing
This step is the stage of preprocessing the collected tweets, which corresponds to line 1 in Algorithm 1. A tweet provides abundant metadata describing its own status. According to the type of tweet, examples of simple metadata are shown in
Figs. 4
and
5
.
Figure 4
shows the status when a seed tweet is posted. In this example, the tweet has metadata that includes the identity (id) of both the tweet and the user.
Figure 5
shows the status when a follower retweets the seed tweet. In this example, the tweet includes not only its id and the id of the follower, but also the id of the seed tweet and id of the user who posted the seed tweet.
If several users retweet the same tweet, they share the same seed-tweet id. We therefore regarded tweets that share the same seed-tweet id as one group and assigned a name according to this id. In Algorithm 1,
rtgi
belongs to one such group.
MapReduce algorithm is suitable for processing the retweet data, as shown in
Fig. 5
. Algorithm 2 is designed to filter spammers and collect retweet data. After the end of the map function, the reduce function in Algorithm 3 receives pairs of < key, List< value>> values. Key corresponds to an id of a user who posts the seed tweet, and List< value> corresponds to a group of users that share the id of the seed tweet. The reduce function is used to provide results in the preprocessing step. The output of the reduce function is a set of tweet-id groups that exclude tweets written by spammers.
Algorithm 2 Map function for preprocessing 1: global var spammer_list ← allocate() 2: map(LongWritable key,/*default */, Text tweet/*Collection of Tweets*/) 3: if(spammer_list_contains(user id of tweet)) 4: continue 5: if(is retweets (tweet)) 6: continue 7: write(id of user who posts seed tweet, id of tweet) 8: spammer_list ← free()
Algorithm 3 Reduce function for preprocessing 1: reduce(id of seed tweed, List<id of tweet> list) /*<key, List<value>>*/ 2: for id of tweet in list 3: write(id of seed tweet, id of tweet) 4: end for
Example metadata of tweet.
Example metadata of retweeted tweet.
C. Creating Retweet Graphs
Lines 4 through 12 in Algorithm 1 show the steps for creating retweet graphs. We create retweet graphs using seed tweets—classified as popular tweets—and a set of groups MapReduce. For the first step, we try to detect whether a set of groups include the id of the seed tweet. If the same id is found, we generate a retweet graph by adding a group to the seed tweet. In the second step, since a graph can have a sub-graph, we try to detect other groups by comparing each tweet id of the graph generated in the first step with the names of the groups in the set.
D. Extracting Features
We extract the social features, content features, posting-time features, and local features in line 13 of Algorithm 1. We extract the TRI and RTI from the retweet graphs and social features, content features, and posting-time features from the seed tweet—some of which are used for extracting other features such as informativeness of a tweet and reliability or activity of authors.
E. Constructing Prediction Knowledge Bases
We store the extracted features in two knowledge bases, as considered in section IV. 3, which corresponds to line 14 in Algorithm 1. We analyzed the amount of traffic being driven by Twitter to divide our data into two parts: one is in user-active time and the other is in user-inactive time.
Figure 6
shows that traffic in Twitter varies depending on the time.
As a result, we determined that user-active time is from 12 p.m. to 2 a.m. and user-inactive time is otherwise. In predicting the popularity of a tweet, we select the prediction knowledge base based on when the tweet was created.
Traffic distribution in Twitter.
- 2. Predicting the Popularity of a Tweet
When the target tweet is entered, we extract tweets with similar topics, retweet patterns, and properties sequentially from the prediction knowledge bases. Algorithm 4 shows this prediction process.
Algorithm 4 Predicting tweet popularity Input eti: target tweet KBi: prediction knowledge bases α, β, γ : the number of knowledge extracted at each step Output Popularity(lifespan, retweet times) 1: KBi = selectKnowledgeBase(eti) 2: TSi = topicSimilarity(eti, KBi, α) 3: GSi = RetweetPatternSimilarity(eti, TSi, β) 4: USi = PropertiesSimilarity(eti, GSi, γ) 5: Let estimated popularity = 0 6: while γ > 0 do 7: estimated popularity+ = REALPOPULARITY(USi,γ) 8: γ = γ − 1 9: end while 10: Popularity = estimated popularity / γ
A. Extracting Tweets of Similar Topics
When the target tweet is given, we select the prediction knowledge base based on the posting time and extract the top
α
tweets that have a similar topic to the target tweet. This corresponds to lines 1 and 2 in Algorithm 4. We then measure the topic similarity, a function of the Jaccard similarity of a text bigram.
B. Extracting Tweets with Similar Retweet Patterns
This step corresponds to line 3 in Algorithm 4. We extract the top
β
tweets that have similar retweet patterns to the target tweet. The retweet patterns are the RTI and TRI. In this paper, we set parameters
n
= 100 and
k
= 5 in extracting
TRI
(
ti
) and parameters
j
= 60 and
p
= 6 in extracting
RTI
(
ti
). We measure the similarity between the target tweet
eti
and the extracted top
α
tweets
hti
, using the Euclidian distance.
C. Extracting Tweets with Similar Properties
In this stage, we extract the top
γ
tweets with similar properties from the top
β
tweets, considering reliability, activity, and informativeness, in line 4 of Algorithm 4. We define
R
(
eti
),
A
(
eti
), and
I
(
eti
) as the reliability, activity, and informativeness of the target tweet, and
R
(
hti
),
A
(
hti
), and
I
(
hti
) as the reliability, activity, and informativeness of the extracted top
β
tweets, respectively.
(5) DIST(e t i ,h t i ) = [(R(e t i )) − R(h t i )] 2 + [(A(e t i )) − A(h t i )] 2 + [(I(e t i )) − I(h t i )] 2
D. Predicting the Popularity of Tweets
This stage predicts the lifespan and retweet times of the target tweet, corresponding to lines 6 through 10 in Algorithm 4. We acquire a set of tweets
KBi
={
ht
0
,
ht
1
,…,
htr
} which are the most similar with the target tweet in that they satisfy the topic, retweet pattern, and properties of the target tweet. Finally, we predict the lifespan and retweet times of the target tweet
eti
by
(6) Popularity(e t i )=(1/γ)* ∑ i=1 γ Popularity(h t i ).
VI. Experiment
- 1. Data Analysis
To evaluate our approach, we conducted data collection from Twitter between June 2012 and October 2012. The dataset contains 473 million postings generated by about 3.7 million users.
Figure 7
shows the distribution of retweets.
Retweet distribution.
In
Fig. 6
, the ratio of tweets retweeted more than 100 times is very small, at about 1% (50,717 tweets), compared to the total number of tweets. In terms of the number of retweets, we assume tweets that are retweeted more than 100 times were popular tweets in the past. In addition, we analyze the lifespan of tweets that were retweeted more than 100 times, as shown in
Fig. 8
. Although the tweets show various lifespan distributions, the number of tweets after ten days is lower.
Figure 8
shows that the number of tweets within a ten-day period accounts for 70% (34,862 tweets). We use only tweets that were created within the first ten days, since the number of tweets after this point is not enough to be used as a prediction knowledge base, and it can be assumed that they were retweeted accidentally. Throughout our analysis, we define popular tweets as those having been retweeted more than 100 times and having a lifespan within the ten-day period.
Lifespan of a tweet distribution.
- 2. Experimental Data
We divided our five-month dataset into training data generated from June 2012 to September 2012 and the test data generated in October 2012.
Table 2
shows the dataset used for predicting the lifespan of a tweet. The authors in
[2]
suggested using a limited duration of 0 hours to 72 hours to evaluate the performance. However, while we used the range of prediction suggested by
[2]
we also extended it.
Table 3
shows the dataset used in predicting the number of retweets.
Experimental data for predicting lifespan.
Duration of a tweet’s lifespan (hours) | Number of graphs used in prediction knowledge base | Number of graphs used in test data |
0 - 24 | 8,697 | 2,394 |
24 - 48 | 5,638 | 1,783 |
48 - 72 | 3,071 | 1,352 |
72 - 96 | 2,189 | 779 |
96 - 120 | 1,709 | 572 |
120 - 144 | 1,470 | 455 |
144 - 168 | 1,314 | 384 |
168 - 192 | 1,049 | 330 |
192 - 216 | 741 | 204 |
216 - 240 | 580 | 151 |
Experimental data for predicting retweet times.
Number of graphs used in prediction knowledge base | Number of graphs used in test data |
26,458 | 8,404 |
- 3. Comparison of Prediction Models
We compare our model with other conventional models. There has been only one conventional research regarding the prediction of a tweet lifespan. Of the algorithms proposed in
[2]
, ATR-KNN (K-Nearest Neighbor) outperformed other approaches. The ATR represents the same author, similar post time, and retweet patterns. However, it was impossible to predict the lifespan when there was no historical data at all or when less than five postings were written by the same author. We approached the following problems that may occur in the previous model
[2]
. For instance, when an author has posted only three tweets in the past, we extracted two similar tweets from other authors by simply considering the retweet patterns and posting time. Regarding the prediction of the number of retweets, we compared the classification based on user preference, which outperformed the various approaches proposed in
[4]
. The interestingness scores of all candidate users were trained by extracting the retweet information that represents the relation between an author who posts the seed tweet and the follower who posts it in a specific category. The interestingness score indicates how likely it is that a user will retweet the seed tweet in a specific category. The number of retweets can be calculated by adding candidate users that are over a certain preference threshold. However, because the training set did not contain all user preferences, it may be impossible to measure some user preferences in the test data. Therefore, we considered only candidate users whose preference information was included in the training data. In
[7]
, tweets were classified into several categories depending on the event, and they selected the top
N
tweets from each category that were retweeted the most. They inferred the target tweet’s number of retweets by measuring the curve similarity that relies on the ratio between the target tweet and the top
N
tweets in each time unit.
- 4. Evaluation Metrics
In this research, we used different evaluation methods according to the prediction condition. In predicting the lifespan of the tweets, we use the root-mean-square error (RMSE) to obtain the time tolerance between the actual observed lifespan and the estimated lifespan. In (7),
N
represents the amount of test data,
Lifespanr
(
ti
) represents the actual observed lifespan of tweet
ti
, and
Lifespanp
(
ti
) represents estimated lifespan. RMSE is calculated by
(7) RMSE= 1/N* ∑ i=1 N [Lifespa n r ( t i )−Lifespa n p ( t i )] 2 .
Instead of directly predicting the exact number of retweets, we evaluated the accuracy of the prediction. The prediction tolerance of tweet
ti
,
PredictionError
(
ti
), is the ratio of the actual observed number of retweets,
RetweetTimesr
(
ti
), to the estimated retweet times,
RetweetTimesp
(
ti
), as shown in the following formula:
(8) PredictionError( t i ) = |RetweetTime s r ( t i ) − RetweetTime s p ( t i ) | / RetweetTime s r ( t i ).
If the
PredictionError
(
ti
) is less than the error threshold, we can say that
ti
is correctly predicted. We set up the various ranges of error threshold, ranging from 5% to 30%. In other words, the error threshold is the level of difficulty. The precision is the ratio of the number of tweets whose
PredictionError
(
ti
) is less than the error threshold to the total number of tweets, as shown in the following formula:
(9) Precision= the number of tweets less than error threshold total number of tweets .
- 5. Feature Analysis
We analyzed the features for measuring their usefulness. First, we evaluated the performance of each feature in the prediction tasks. We also calculated how combining features impact the performance.
Figures 9
and
10
show the analysis results.
Figures 9(a)
and
10(a)
are the results of predicting the popularity of a tweet using a single feature. Both results indicate that RTI and TRI are useful features related to retweet patterns. Informativeness of the tweet shows the lowest value in predicting the lifespan of a tweet. Considering the limited number of characters of a tweet, we found that it is difficult to predict the popularity of a tweet with only its informativeness.
Figures 9(b)
and
10(b)
show the results of predicting the popularity of a tweet using a group of features. The group of features is woven from a similar disposition. The retweet patterns consist of RTI and TRI. The properties pattern consists of user reliability, user activity, and informativeness of a tweet. Both results indicate that the retweet patterns outperform the other group of features. In addition, we found that combining features shows a higher performance than using only single features. Based on the results of the feature analysis shown in
Figs. 9(b)
and
10(b)
, we combined each group to find the optimum combination of features.
Figures 9(c)
and
10(c)
show the results of predicting the popularity of a tweet using groups of features. Both results show that the combination consisting of tweet topic, retweet patterns, and property patterns outperformed other combinations.
Analyzing features for predicting lifespan of a tweet: (a) single feature, (b) group of features, and (c) combining groups of features.
Analyzing features for predicting the number of retweets: (a) single feature, (b) group of features, and (c) combining groups of features.
- 6. Experimental Results
We carried out the experiment based on various prediction ranges.
Table 4
shows that about six hours of tolerance exists within a prediction range of 24 hours and that about 54 hours of tolerance exists within a prediction range of 240 hours. Because the previous model limits the prediction range to 72 hours, we evaluated the limited range to construct a similar experiment environment. In addition, we extended the prediction range to evaluate whether it works well under flexible conditions.
Time tolerance within prediction range.
Prediction range (hours) | Time tolerance (hours) |
0 – 24 | 6.16 |
0 – 48 | 11.43 |
0 – 72 | 18.32 |
0 – 96 | 23.73 |
0 – 120 | 29.27 |
0 – 144 | 35.22 |
0 – 168 | 41.23 |
0 – 192 | 46.85 |
0 – 216 | 50.76 |
0 – 240 | 54.87 |
Table 5
shows the comparison of the two models within a period of 72 hours. The proposed model results in an outstanding accuracy compared with the previous model
[2]
, having a tolerance of about 18 hours. In the previous model
[2]
, the author’s historic posting is the most significant feature for predicting the lifespan of a tweet. In other words, the tolerance increases when not enough historic postings are written by the same author. The performance results in extending the prediction range are shown in
Fig. 11
. The performance is similar to within three days. The immense prediction knowledge base of the first three days can be useful for the model in
[2]
. However, as the prediction range widens, the performance difference becomes increasingly larger.
Comparison of model performance within a range of 72 hours.
Algorithm | Time tolerance (hours) |
ATR-KNN (Kong 12) | 22.24 |
Proposed method | 18.32 |
Comparison of model performance according to time range.
The precision in predicting the number of retweets was evaluated according to the respective error threshold. Because using only 33 test data in the previous models
[4]
and
[7]
caused low reliability, we evaluated the test data shown in
Table 3
to enhance the reliability.
Figure 12
shows the results of the experiment. When we set the error threshold to about 20%, the proposed model achieved a significantly outstanding precision at about 0.5, in contrast to conventional models, which showed a precision of around 0.3 and 0.2, respectively. In
[7]
, despite excluding 4,790 unpredictable datasets from all 8,944 test datasets, it shows the lowest performance among them. We concluded that the user-preference property is changeable as time passes and is not handled flexibly when new users are detected.
Comparison of model performance according to error threshold.
In
[4]
, the model is similar to the proposed model, wherein it is based on similar historic tweets. However, it relies significantly on the original number of retweets of the training dataset, and the results can therefore be variable.
VII. Conclusion
In this paper, we propose an algorithm to predict the lifespan of a tweet and the number of retweets, which are a proxy for measuring its popularity. To achieve this, we suggest a prediction framework of tweet popularity consisting of two phases: one is generating prediction knowledge bases, and the other is predicting the tweet’s popularity. In the phase of generating prediction knowledge bases, we analyzed the features that affect a retweet to construct the prediction knowledge bases. In the phase of predicting the tweet’s popularity, we extract historical tweets that have similar properties to those of the target tweet in a step-by-step manner.
As shown in the experimental results, our model can perform better than previous prediction models, for the following reasons. First, there are few constraints on the target for prediction. A previous model predicted the popularity of a tweet based on either user preferences or the historical tweets posted by the same author. As a constraint of the conventional model, the prediction is possible, if and only if, there is sufficient information; this can lead to difficulty in predicting the popularity. However, the proposed model does not have the above problems, because it is based on similarity with the target tweet. Second, our model has excellent scalability. In reality, the lifespan of a tweet is wide ranging; and to use conventional methods of prediction, historical tweets written by the author of the targeted tweet must consist of various distributions. In other words, sufficient historical data for each prediction range is required. However, our approach deals well with the above constraint as it considers collaborative features. In addition, this method has an advantage of extracting more similar historical tweets.
This work was supported by the IT R&D program of MSIP/KEIT (10044577, Development of Knowledge Evolutionary WiseQA Platform Technology for Human Knowledge Augmented Services).
BIO
yongjin@etri.re.kr
Yongjin Bae received his BS degree in computer education from Mokwon University, Daejeon, Rep. of Korea, in 2012 and his MS degree in computer software and engineering from the University of Science and Technology, Daejeon, Rep. of Korea, in 2014. His main research interests are social big data analytics and text mining.
pmryu@etri.re.kr
Pum-Mo Ryu received his BS degree in computer engineering from Kyungpook National University, Daegu, Rep. of Korea, in 1995 and his MS degree in computer engineering from POSTECH, Pohang, Rep. of Korea, in 1997. He received his PhD degree in computer science from KAIST, Daejeon, Rep. of Korea, in 2009. Currently, he is a senior researcher in Electronics and Telecommunications Research Institute, Daejeon, Rep. of Korea. His research interests include natural-language processing, text mining, knowledge engineering, and question answering.
hkkim@etri.re.kr
Hyunki Kim received his BS and MS degrees in computer science from Chunbuk National University, Jeonju, Rep. of Korea, in 1994 and 1996, respectively. He received his PhD degree in computer science from the University of Florida, Gainesville, USA, in 2005. Currently, he is a principal researcher in Electronics and Telecommunications Research Institute, Daejeon, Rep. of Korea. His research interests include natural-language processing, machine learning, question answering, and social big data analytics.
Oh H.J.
,
Lee C.K.
,
Lee C.H.
2012
“Analysis of the EmpiricalEffects of Contextual Matching Advertising for Online News,”
ETRI J.
34
(2)
292295 -
DOI : 10.4218/etrij.12.0211.0171
Kong S.
“Predicting Lifespans of Popular Tweets in Microblog,”
Int. ACM SIGIR
Portland, Oregon, USA
Aug. 12–16, 2012
1129 -
1130
Hong L.
“Predicting Popular Messages in Twitter,”
Int. Conf.WWW
Hyderabad, India
Mar. 28 – Apr. 1, 2011
57 -
58
Unankard S.
“On the Prediction of Re-tweeting Activities in Social Networks - A Report on WISE 2012 Challenge,”
WISE, Paphos, Cyprus
Nov. 28–30, 2012
7651
744 -
754
Zhang L.
,
Zhang Z.
,
Jin P.
“Classification-Based Predictionon the Retweet Actions over Microblog Dataset,”
WISE, Paphos, Cyprus
Nov. 28–30, 2012
771 -
776
Artzi Y.
,
Pantel P.
,
Gamon M.
“Predicting Responses to Microblog Posts,”
NAACL HLT
Montreal, Canada
June 3–8, 2012
602 -
606
Petrovic S.
“RT to Win! Predicting Message Propagation inTwitter,”
ICWSM
Barcelona, Catalonia, Spain
July 17–21, 2011
586 -
589
Zaman T.R.
“Predicting Information Spreading in Twitter,”
Workshop CSSWC NIPS
Whistler, Canada
Dec. 10, 2010
Suh B.
“Want to be Retweeted? Large Scale Analytics on Factors Impacting Retweet in Twitter Network,”
IEEE, Int. Conf.SocialCom.
Minneapolis, MN, USA
Aug. 20–22, 2010
177 -
184
Twitter Inc.
The Twitter Rules of Spam and Abuse
http://support.twitter.com/articles/18311-the-twitterrules
Castillo C.
,
Mendoza M.
,
Poblete B.
“Information Credibilityon Twitter,”
Int. Conf. WWW
Hyderabad, India
Mar. 28 – Apr. 1, 2011
675 -
684
Zhao W.X.
“Comparing Twitter and Traditional Media Using Topic Models,”
ECIR, Dublin, Ireland
Apr. 18-21, 2011
338 -
349
Wang C.
,
Huberman B.A.
2012
“Long Trend Dynamics in Social Media,”
EPJ Data Sci.
1
(1)
DOI : 10.1140/epjds2
Krishnamurthy B.
,
Gill P.
,
Arlitt M.
“A Few Chirps about Twitter,”
WOSN
Glasgow, Scotland, UK
Apr. 1–4, 2008
19 -
24