08 Jun, 2022 10% of Twitter’s active accounts are posting spam content, estimates GlobalDataPosted in Business Fundamentals
A mathematical model designed by GlobalData has estimated that around 10% of Twitter’s active accounts are posting spam content. The leading data and analytics company notes that this is double that of Twitter’s reported figure—likely due to a difference in criteria as to what counts as ‘spam’.
Sidharth Kumar, Senior Data Scientist at GlobalData, comments: “What is or is not spam is suddenly an important discussion point for the social media platform, given that Elon Musk’s bid to take over Twitter is now on hold due to a disagreement on the proportion of spam accounts on the platform. Twitter claims that bot/spam accounts on Twitter represent less than 5% of accounts while Elon Musk’s team thinks otherwise.
“The precise proportion of spam accounts is difficult to compute, as it is almost impossible to confirm the identity of the entity behind a tweet handle. Additionally, the definition of spam account may differ for everyone. Incessant tweeting of non-original content can be considered spam, but some may choose to see it as a very active user sharing articles/opinions.”
Keeping all this in mind, GlobalData’s mathematical model estimated the number of spam accounts using multiple parameters to provide a weighted score, which was then used to determine the classification of ‘spam’ or ‘non-spam’. GlobalData decided on these parameters by focusing on the differences in activity between typical spam accounts and that of an average Twitter user. Accounts performing poorly on many parameters received a higher score, indicating a higher probability of being spam. GlobalData analysts then independently observed handles at different score levels, and decided the cutoff for the classification (‘spam’ or ‘non-spam’) by consensus. The parameters used in the model were as follows:
- Is the tweet handle verified? Verified handles are unlikely to indulge in spam
- Is a tweet coming from third-party avenues? Tweets coming from third-party applications are likely to produce spam. Private Twitter API-based apps are often used for posting spam content
- What is the number of historic Tweets that the handle has produced, divided by the days since its creation? Typically, spam accounts have a very high number of tweets per day over a lifetime
- How frequent were the last 200 tweets? A very high number of Tweets published over a short span of time is more likely to be spam
- What is the proportion of retweets in the last 200 tweets? Some spam accounts only retweet certain target accounts/topics on a regular basis
- Of the last 200 Tweets, how many did not contain any hashtags or links? Spam accounts are unlikely to have plain-text content. They typically promote certain link, tweet or hashtag.
- What is the standard deviation in typical tweet length? Some spam accounts keep posting similar messages in high frequency and do not have high variance in the content or its length
- What is the median time between two tweets? Non-bot accounts typically have a higher median tweet time between tweets
- What is the length of the description in the profile? Typically, non-bot active accounts have more detailed bios
- Of the last 200 Tweets, what is the proportion of links shared? Spam accounts have more tendency to share lot links on Twitter
Kumar continues: “There were a few research pieces published earlier in the media looking at the followers of certain handles to estimate spam or bot proportions. We felt that the correct approach would be to analyze samples of live streams, as that is more indicative of Twitter activity. Our estimate is conservative, as we wanted to be sure that we were correctly identifying accounts as spam. It is important to note that this is still an estimation. There is no conclusive way of knowing if a certain account is a bot or spam.”
The following chart shows the median values for spam/non-spam accounts for the parameters used by the model.