DataProcessingSQL

February 12, 2021

Know our data

The design of the schema that contains the data from the FakeNews Corpus data set

In our design of the schema, we have tried to reduce various anomalies as much as possible. We started by cramming all the data into a single relation, but we quickly found out that this created a lot of anomalies. Therefore, we iteratively decomposed this relations until we ended up with the design that can be viewed on figure \ref{fig:er-diagram-fakenews-corpus}. By designing our database this way, we have reduced redundancy, update anomalies, and deletion anomalies significant.

The reason why we have a time relation, is that we found out that there are only 5 different timestamps. Therefore, we could remove many anomalies by making it a separate relation. For all the relationships that are one-to-many or many-to-many, we have likewise separated them into their own relations. This includes the following relations: keyword, typ, author, webpage, and domain.

We have assumed, that no article contains the same content, and no articles contains the same meta_description. We did find out, that some of the articles actually have the same title, so we could reduce anomalies by separating this attribute into its own relation. However, we decided to keep it as an attribute. Since we often use the title, it would make it cumbersome to make a natural join each time we need it.

Inherent problems with the data from the Fakenews Corpus data set during development

We found that some of the data is in the wrong fields, and some of it are not formatted correctly. Furthermore, we also found out that the fields keyword and summary are completely empty for all the articles. For many of the articles, the field called meta_descriptions does not contain anything.

Another problem we found while working with the data set was that we did not understand the difference between meta_keyword and keyword. We assume that meta_keyword contains the keywords of the articles. Likewise, we did not understand the difference between meta_description and summary. Here we assume that meta_description contains a summary of the articles.

Another problem with the data set is that there are two different IDs. The first ID is the unnamed field, which repeats itself after 100000 articles. The second ID, which is called id in the csv file, is quite random. Because of these two problems we create our own article_id instead of using the id’s that were provided.

Properties on the Fakenews Corpus data set

The queries that we have used to get the statistics and visualizations, that can be seen below, can also be found in \hyperref[Appendix A]{Appendix A}.

From this table, one can see that our database contains a lot of articles.

Basic Statistics
Number of articles:	999992
Avg. number of keywords per article:	6.96
Avg. number of words per article:	271.32

In the figure below, one can see that the database does not contain a lot of reliable articles. It is also possible to observe that the distribution is not uniform. These two properties of the data set can create difficulties when we have to train our model for the predictor.

Our initial thought was that clickbait articles would contain a lot of keywords since they could be used to oversell the articles but this is not the case as can be seen in the figure.

The 20 domains with the most articles scraped from:

From figure below, one can see that approximately 40% of the articles comes from the three domains: dailykos, wikileaks, and beforeitsnews. If these three sites are labeled wrong, all the articles from these three domains are labeled wrong, which will have a huge impact on our predictor.

SELECT count(*),
       domain_url,
       type_name
FROM fakenewscorpus.article
NATURAL JOIN fakenewscorpus.webpage
NATURAL JOIN fakenewscorpus.domain
NATURAL JOIN fakenewscorpus.typ
GROUP BY domain_url,
         type_name
ORDER BY COUNT DESC
LIMIT 20;

The average content length of the articles of each type:

Below we can see a figure of the article length (characters in the article) given the type that the article is labeled with. It is clear that the “hate” article type has a far higher average wordcount in relation to the other article types.

SELECT type_name,
       avg(content_length) AS avg_content_length
FROM
  (SELECT article_id,
          title,
          length(content) AS content_length,
          type_id
   FROM fakenewscorpus.Article) AS tmpTable
NATURAL JOIN fakenewscorpus.typ
GROUP BY type_name
ORDER BY avg_content_length DESC;

The average number of words of the articles of each type:

Below we see that the number of words correspont to the length of the article. It is there not beacuse the articles of type hate have far longer words in their articles that they have a longer average article length.

SELECT type_name,
       avg(word_count) AS avg_word_count
FROM
  (SELECT article_id,
          length(replace(content, ' ', '')) AS content_length_no_spaces,
          length(content) - length(replace(content, ' ', '')) AS word_count,
          type_id
   FROM fakenewscorpus.Article) AS tmpTable
NATURAL JOIN fakenewscorpus.typ
GROUP BY type_name
ORDER BY avg_word_count DESC;

The average word length of the articles of each type:

This is confirmed by selecting the average wordcount for each of the article types.

SELECT type_name,
       avg(content_length_no_spaces) / avg(word_count) AS avg_word_length
FROM
  (SELECT article_id,
          length(replace(content, ' ', '')) AS content_length_no_spaces,
          length(content) - length(replace(content, ' ', '')) AS word_count,
          type_id
   FROM fakenewscorpus.Article) AS tmpTable
NATURAL JOIN fakenewscorpus.typ
GROUP BY type_name
ORDER BY avg_word_length DESC;

Experiences with scraping from the “Politics and Conflict” section of the Wikinews site

We started with the approach of using all the start urls on the form of:

https://en.wikinews.org/w/index.php?title=Category:Politics_and_conflicts&from=letter where letter is [D-N].

The downside of this was, that not all articles was on their start letter page. This meant that some F articles was on another page, and was thus not included. Furthermore we got some articles from other letters. We solved this by checking for WantedArticles = r"/wiki/[D-N]".

The next approach was to yield the parse() on itself when going to next page and starting on the D page. The spider would thus crawl to nextpage until it reached the letter O.

Extracting the specific information from each article was a bit of a hassle. Some articles' content are structured in tables while over 90% have their content in paragraphs.

Below we have the first parser of the spider.

This spider will run over all of the articles and yield the url to the individual articles and parse them to the parse2 function (line 9). After having scraped all articles for one subpage it will proceed to the next page, if such exists (line 20)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20


#We want to follow the next page from the start urls until  we are finished entirely (at letter O)
#Will find all links in the start url that correspond to category
def parse(self, response):    
    links = response.xpath('/html/body/div[3]/div[3]/div[4]/div[2]/div[2]/div/div/div/ul/li/a/@href').extract()
    WantedArticles = r"/wiki/[D-N]" 

    for link in links:
        if link and (re.match(WantedArticles , link ) != None):
            yield response.follow(url = link, callback = self.parse2)

    #Change the last in the string to the last letter
    forbiddenNextPage =  r"https:\/\/en\.wikinews\.org\/w\/index\.php\?title=Category:Politics_and_conflicts&pagefrom=O"
    #This has to be get() instead of extract so that forbiddenNextPage in nextpageurl can be compared as strings
    nextpageurl = response.xpath("//*[@id='mw-pages']/a[contains(.,'next page')]/@href").get()
    nextpageurl = response.urljoin(nextpageurl)

    # If we've found a pattern which matches
    if nextpageurl and (re.match(forbiddenNextPage , nextpageurl ) == None):
        # Return a call to the function "parse"
        yield scrapy.Request(nextpageurl, callback=self.parse) 

Below we have the parser for each individual article that is parsed from function parse above.

The parse2 function will yield the title, url, keywords (only applicable to some), date of writting the article, sources used in article

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63


#Will run over all individual articles and extract appropriate data
    def parse2(self, response): #return individual article
        
        title = response.xpath('/html/body/div[3]/h1/text()').get()

        content = response.xpath("""/html/body/div[3]/div[3]/div[4]/div/p/descendant-or-self::text()[not( parent::strong | ancestor::i)]""").extract()
        content = ''.join(content)

        url = response.url

        # Keywords and date
        keywords = response.xpath("""//div[contains(@class, 'mw-normal-catlinks')]/ul/descendant-or-self::text()""").extract()
        finalkeywords = '[\"'
        written_at = ""
        date_index = 0
        dateRegexPattern = r"(?:january|february|march|april|may|june|july|august|september|october|november|december)(?: )(?:[\d]{1}|[\d]{2})(?:, )(?:1\d{3}|2\d{3})"
        for i in range(len(keywords)):
            keyword = keywords[i].lower()
            if(re.fullmatch(dateRegexPattern, keyword) != None):
                date_index = i
            else:  
                finalkeywords += (keyword + '\", \"')   
        written_at = keywords[date_index]        
        keywords.pop(date_index)
        if len(finalkeywords) < 3:
            finalkeywords = ''
        else:     
            finalkeywords = finalkeywords[:-3] + ']'

        # Sources
        sources = []
        sourcesFromTemplate = response.xpath("""//a[contains(@class, 'external text')]/@href""")
        for li in sourcesFromTemplate:
            temp = li.get().lower()
            sources.append(temp)
            
        finalsources = '[\"'
        
        for i in range(len(sources)):
            source = sources[i].lower()
            finalsources += (source + '\", \"')
        
        if len(finalsources) < 3:
            finalsources = ''
        else:
            finalsources = finalsources[:-3] + ']'            

        global article_id 

        yield {
            'article_id'   : article_id,
            'title'        : title,
            'sources'      : finalsources,
            'scraped_at'   : scraped_at,
            'content'      : content,
            'keywords'     : finalkeywords,
            'written_at'   : written_at,
            'domain'       : domain,
            'url'          : url,
            'type'         : typ,
        }

        article_id += 1

The design of the schema that contains the data from the Wikinews Fragment data set

We have designed the schema that contains the data from the Wikinews Fragment data set, more or less, like how we designed the schema, that contains the data from the Fakenews Corpus data set. The only differences between the Fakenews Corpus schema design, and this Wikinews Fragment schema design, is that we do not have a meta_description attribute. We do not have an updated_at or inserted_at date, but instead we have a written_at date, and that the articles do not have any authors but they instead have sources. The picture of the E/R diagram can be seen below

Basic properties of the Wikinews Fragment data set

In the table below, one can see that our Wikinews Fragment does not contain a lot of articles. However, since they are all classified as reliable, if we combine these with our Fakenews Corpus data set, we get approximately 50% more reliable articles than was contained in the Fakenews Corpus data set.

From the statistics in the table, we can also observe, that the articles are quite small. We can also see, that the average number of words per article is 176, compared to 271 which was the average number of words per article in the Fakenews Corpus data set.

Basic Statistics
Number of articles:	2747
Avg. number of keywords per article:	9.51
Avg. number of words per article :	176.89

In figure we can observe that distribution of when the articles were published is not uniform. We can see that the vast majority of the articles are from before 2009.

In figure we can see that we could cut the length of the content of the articles to 500 words and get the vast majority of the data. This could be a good idea to reduce the use of memory and computation.

Integration of the Fakenews Corpus data set and Wikinews Fragment data set

The sql command that creates the view can be found at \hyperref[sql-command to create view]{Appendix E}. We have tried to get as much of the meta data integrated in this view as possible.

In the view each article has the following attributes: content, title, type, domain, and url. We could have included scraped_at, but we thought that it is not very useful in regards to our predictor. We decided to not include the information that is represented as many-to-many relationships such as keywords since that would blow up the size of the view.

Our view contains 1002602 articles.

Choosing a dataset

Since our predictor cannot be trained on more than 50k articles, and the Fakenews Corpus dataset contains around 6000 reliable articles, in order to get a balanced training set, we are not able to include the Wikinews Fragment dataset. Furthermore the classification used to determine all wikinews articles as being reliable does not have any merit to back it up. If we had access to better machines, it would not be a problem to use the view described in 1.7 that contains both data sets.

Classification into fake and real

We will classify the articles that are reliable or political as real, and all other articles of the other types as fake. The reasoning behind this, is that we would rather have fake negatives than fake positives. Considering the majority of the article types as being fake, will grant us a better insurance that we do not falsely classify fake content as being real. There is either a clear malicious intent (hate, click-bait, fake), or unintended false, or unjustified content (junk-science, satire, conspiracy, unreliable). Content that is considered in the fakenewscorpus as “on the line” will be classified as fake as well (unknown, bias).

Establish a baseline

Basic models on content

In our baseline model we will use a TF-IDF vectorizer to structure the words and their frequency compared to the general frequency of all the training articles. We utilize 457828 article contents and we use two baseline models: a linear SVC, and a Dummy classifier. By the use of these two models we get the results described in this table.

Classifier	Liar	Fakenews	Kaggle
LinearSVC	0.514	0.78	0.648
Dummy	0.499	0.501	0.502

Including additional fields

In the beginning we thought it could make sense to use keywords. However, a problem with this is that a lot of the articles in our database do not have any keywords.

It might make sense to use the title of the article as a feature. Although, it is short and we expect that more distinct information is found in the content. However, the title in of itself could be used to determine whether an article is a clickbait.

We decide overall that these features from meta-data would not contribute to the general performance of a model.

Appendices

Appendix A - SQL queries to get the properties of the Fakenews Corpus data set

Number of articles:

SELECT count(*)
  FROM fakenewscorpus.article;

The average number of keywords for each article:

SELECT Avg(keyword_count) AS avg_keyword_count
FROM
  (SELECT article_id,
          Count(*) AS keyword_count
   FROM fakenewscorpus.tags GROUP  BY article_id) AS tmpTable
NATURAL JOIN fakenewscorpus.article;

The average number of words for each article:

SELECT avg(word_count) AS avg_word_count
FROM
  (SELECT length(content) - length(replace(content, ' ', '')) AS word_count
   FROM fakenewscorpus.article) AS tmpTable;

Distribution of articles over types:

SELECT count(*),
       type_name
FROM fakenewscorpus.article
NATURAL JOIN fakenewscorpus.typ
GROUP BY type_name
ORDER BY COUNT DESC;

The average number of keywords of the articles of each type:

SELECT type_name, 
       Avg(keyword_count) AS avg_keyword_count 
FROM   (SELECT article_id, 
               Count(*) AS keyword_count 
        FROM   fakenewscorpus.tags 
        GROUP  BY article_id) AS tmpTable 
       natural JOIN fakenewscorpus.article 
       natural JOIN fakenewscorpus.typ 
GROUP  BY type_name 
ORDER  BY avg_keyword_count DESC;

Appendix D - SQL queries to get the properties of the Wikinews Fragment data set

Number of articles:

SELECT count(*)
FROM wikinewsfragment.article;

The distribution of the number of articles over dates:

SELECT count(*),
       time
FROM wikinewsfragment.article
INNER JOIN wikinewsfragment.time ON wikinewsfragment.article.written_at_time_id = wikinewsfragment.time.time_id
GROUP BY time
ORDER BY COUNT DESC;

The average, minimum, and maximum number of keywords for each article:

SELECT Avg(keyword_count) AS avg_keyword_count,
       Min(keyword_count) AS min_keyword_count,
       Max(keyword_count) AS max_keyword_count
FROM
  (SELECT article_id,
          Count(*) AS keyword_count
   FROM wikinewsfragment.tags GROUP  BY article_id) AS tmpTable
NATURAL JOIN wikinewsfragment.article;

The number of distribution of the articles with a specific number of words:

SELECT word_count,
       count(*)
FROM
  (SELECT length(content) - length(replace(content, ' ', '')) AS word_count
   FROM wikinewsfragment.article) AS tmpTable
GROUP BY word_count
ORDER BY word_count;

The average, minimum, and maximum number of words for each article:

SELECT avg(word_count) AS avg_word_count,
       min(word_count) AS min_word_count,
       max(word_count) AS max_word_count
FROM
  (SELECT length(content) - length(replace(content, ' ', '')) AS word_count
   FROM wikinewsfragment.article) AS tmpTable;

Appendix E - Query to create the view that integrates the two data sets

CREATE MATERIALIZED VIEW integrateFakenewsSources AS
  (SELECT title,
          type_name,
          content,
          url,
          domain_url
   FROM wikinewsfragment.article
   NATURAL JOIN wikinewsfragment.typ
   NATURAL JOIN wikinewsfragment.webpage
   NATURAL JOIN wikinewsfragment.domain)
UNION ALL
  (SELECT title,
          type_name,
          content,
          url,
          domain_url
   FROM fakenewscorpus.article
   NATURAL JOIN fakenewscorpus.typ
   NATURAL JOIN fakenewscorpus.webpage
   NATURAL JOIN fakenewscorpus.domain);

Made in collaboration with

MarcusTM linkedin

Link to github repo

processing