DIS Computational Analysis of Big Data Course
December 5, 2018
The internet age has ushered in a series of rapid changes in technological advances and increased the ways in which the world communicates with each other. It has led to greater diversity in news sources and heightened the reporting powers of the press with the ability to reach millions of viewers cheaply due to globalization connecting the world together. However it has its own drawbacks as with the benefits of more content there is an increasing inability to comprehend it due to the sheer scale of data that is available and being published daily.
This information overload combined with the use of social media has led to more reliance on headlines to understand current events that are happening. Yet the way in which people (especially during the American 2016 elections) interacted with news by sharing news content purely based on their title allowed misleading and in a lot of cases outright false information becoming more widespread. The amount of news that was based on false claims became so infamous that Oxford Dictionaries called "post-truth" 2016's Word of the Year.
A study by the Pew Research Center showed that an increasing number of Americans (particularly younger Americans) rely on social media as a source for the news. Knowing this, we decided to look at news that were shared in social media platforms to see whether we could train a model that could detect false news headlines to help determine what a user could trust.
This is an educational study and contains some data-sets that were determined with factual-based opinions and should not be used as a source of criticism or approval. The study does not go in depth as to why certain articles may have been shared or the contents of such articles. Our findings purely look at trends in the data and do no imply causation of any kind.
Data
Our source of information was Reddit as it was a familiar source of information and it had helpful Python packages to get the data we needed. More specifically, we used PSAW which is a wrapper for PushShift that includes its own site to make API requests to. The PushShift API allowed us to set date ranges on our data scraping, which was helpful as we wanted to use data during the 2016 elections to see if a model could be built to filter those results (some exceptions were made to ensure enough data was pulled).
We categorized subreddits based on their reputations, description of purpose, and rules for posting to ensure we chose content rich sources that would help our analysis. For determining real news, we relied on three subreddits: news, worldnews, and politics as they were focused on covering current events and were relatively politically neutral that did not allow false headlines to be posted. As for our initial false news detection, we relied on the subreddits satire and theonion (2015-2018 to get enough data) since they were based on purposely sharing false information that is somewhat believable, especially based on the title.
As we were thinking about the data, we were curious to see if politically biased subreddits had a tendency to share news that might not necessarily be true but would be in line with their views. We then added four more subreddits for the American right and left wing: conservative, republican, democrats, and liberal. Yet on looking at these subreddits, we saw that they also moderated their own links, and decided to push for more extreme subreddits that were more controversial and so looked at pulling data from The_Donald, HillaryForPrison, Impeach_Trump, and Fuckthealtright (post-2016 election, as these subreddits did not exist until then) as they might be more likely to share news that would back their views no matter the source.
Process
To train our model, we used headlines from our defined 'real news' and 'fake news' sources to create a comprehensive Bag of Words Matrix that held counts of each word occurrence after filtering out stop-words. We further filtered our data by ensuring only headlines in English were analyzed as a majority of articles were in English and we did not want non-English titles to throw off our model. Finally, we had to ensure that the sources of these titles came from outside Reddit, as we pulled each post's title, url, and timestamp.
From this, we created a matrix for each headline and added a final identifying feature of which set of news it came from, and used it to train our model. We chose the Bernoulli Native-Bayes classifier (sklearn python package) as our identifying feature was binary (0 for false, 1 for true) to train and test our model. The training accuracy score of our model was ~0.875, or approximately 0.88 (88%). A histogram of the accuracy scores for our news and satire data is shown below.
We then repeated our model for each of the biased news sources and trained them to see if our model could accurately predict whether a lot of false headlines were being shared in the subreddits.
Our first test was comparing moderately biased subreddits from both the right and left leaning parts of the American political system to see if our model's results showed any interesting trends in the data. We chose a histogram as it would show whether our model had difficulty in predicting the truth values of headlines. As we can see, however, our model tends to rate headlines in either the real or false news category with few that it was indecisive about which is interesting as it closely mirrors our news and satire results perhaps showing bias in our model.
A repeat of the histogram visualization happens when viewing more extreme biased subreddits, which was a bit of a surprise as we expected them to hold a mixture of values from our model. However, it may be that news sources shared tended to be from more reputable sources than from fringe ones with may help explain the split our model has. It could also be that our model is biased and tends to put data in either end of the spectrum rather than closer to the middle.
At first glance, it appears that either our model is not preforming well, or that most news sources that were shared in the biased and extremely biased subreddits were more reputable than anticipated. To determine whether our model was off or whether the sources shared were mostly from real sources, we plotted the top 10 news sources for each of our groups: real news, false news, left bias, right bias, extreme left bias, and extreme right bias. Our results are shown below.
The real news sources hold many familiar and recognizable sources to many US and international readers. The only one we found odd was newsjs.com, which is, from this Medium Article, a free news aggregator. However, some Googling led us to believe that the site, though has real content, should not be trusted as it tends to redirect to internal links instead of the source article. This may have had a negative effect on our model. In the satire site, we see expected sites like The Onion and Clickhole which are well known satire sites.
Looking at the moderately biased sites show a similar story with many news sources linked to reputable sources along with sites that have biased views. For example, breitbart and redsate write articles that lean to the American right politically whereas sites like thinkprogress lean to the American left politically. Having a bias does not automatically mean news sources are less true, just that they may have more opinionated articles in them. In fact, the model may have performed exceptionally as the biased subreddits tended to use sources that are regarded for their factual writings which may explain the earlier histogram.
Ah, here is our problem. A lot of the urls listed here for the extremely biased subreddits do not appear to be news sources but rather image links which likely has post titles that are not verbatim of news articles. Our model likely is unable to handle these edge cases along with some word filtering so it may tend to err more on predicting real rather than false. Yet in the extremely American left subreddits, it shows more recognizable sources which may explain why the histogram for the group tends to have more as real than not.
Well for the first few top sources in our biased sources it appears that our model is doing fine as a lot of sources also appear in our 'real news' category. However upon closer inspection it appears that our model is not taking titles correctly into account. It really shows in the extremely biased sources as most are not from news sources but rather images that were not filtered out due to the limitations of handling different url formats. In the moderately right-wing biased subreddits, breitbart is cited frequently yet is known for its controversial views and articles yet perhaps it should not be a surprise as news sources tend to look similar to each other if they are reporting on the same events.
Why does our model fail to perform as well as expected?
A large part of our model's performance, or lack thereof in some cases, can be attributed to the filtering process that we use to create our Bag of Words Matrix. Only words previously introduced in our training data will be attached to our testing data when referencing other subreddits and does not apply any weighting to articles nor does it adapt by adding new words to its corpus. This filtering of words probably negatively impacts our model as it relies on new articles to share words that our training data had, and must discard any new words that may have had a larger impact on the predicted truth value. Our model's accuracy could have also improved if we had the time to work through some K-Folds as well to improve the model.
Another issue that our model faced was that we did not have time to implement another feature to train our data set on such as seeing which news source was more likely to be factual than others. Had we been able to do so, it is likely that the more biased subreddits would have seen more sources that were not news articles would have been pushed by the model towards less true. Some of the blame also falls on our model's overall training accuracy of about 88%, which could be improved dramatically as well to increase confidence in it.
Most importantly, it shows that headlines alone are not a reliable indicator of whether a news story is real or false, and that other factors must be taken into account.
Future Outlook
We only scratched the surface in terms of the data that we collected (over 100,000 headlines over a small time-frame, when there are many times more articles available) and just from one relatively small social media platform. Our recommendations for future model training should have a much larger data-set and should pull from all of the platforms including Facebook, Twitter, and Reddit to analyze the news sources. Even more important is the content of the articles, as it often has clues about whether it is factual, opinionated, or is outright a lie. (Yet it begs the question of what opinionated articles should be classified as, since they are more arguments by the writer to the reader of the article than a news story but not many would call them false.) Better models that include Natural Language Processing as part of their training would also likely perform better as it would be able to analyze the context of an article and be able to rate each one rather than rely on the source url alone.
Other challenges that we were not able to place into our model was looking at news that sound fake but in actuality were real or taking into consideration redacted articles in real news when new evidence comes up. This would be a hard test for a model to pass, as many articles are so outlandish that people would also think they were false. It also would have to deal with misleading articles that may omit key facts that would still be technically true, but where many people would classify as false due to the poor journalism.
This project has shown us the initial difficulties in training a model to help discern real and false news. Our model's relatively poor accuracy meant our predictions for other news sources were not as reliable as they could have been. However when most Americans believe fake news headlines is it really that big of a surprise on how difficult the problem is to solve? With clickbait articles and sensationalist journalism in the so-called era of 'post-truth', it is more important than ever that people worldwide can trust what they read in the news no matter where they come from.
Our code can be found here.