Alina Petukhova
Nuno FachadaNuno Fachada
COPELABS, Lusófona University, Campo Grande 376, 1749-024 Lisbon, Portugal Author to whom correspondence should be addressed. Data 2023, 8(5), 74; https://doi.org/10.3390/data8050074Submission received: 19 March 2023 / Revised: 17 April 2023 / Accepted: 19 April 2023 / Published: 23 April 2023
This article presents a dataset of 10,917 news articles with hierarchical news categories collected between 1 January 2019 and 31 December 2019. We manually labeled the articles based on a hierarchical taxonomy with 17 first-level and 109 second-level categories. This dataset can be used to train machine learning models for automatically classifying news articles by topic. This dataset can be helpful for researchers working on news structuring, classification, and predicting future events based on released news.
A news dataset is a collection of news articles classified into different categories. In the past decade, there has been a sharp increase in news datasets available for analysis [1]. These datasets can be used to understand various topics, from politics to the economy.
A few different types of news datasets are commonly used for analysis. The first is raw data, which includes all the data that a news organization collects. This data can be used to understand how a news organization operates, what stories are covered, and how they are covered. The second type of news dataset is processed data. These data have been through some processing, such as aggregation or cleaned up. Processed data are often easier to work with than raw data and can be used to answer specific questions such as providing additional information for the decision-making process. The third type of news dataset is derived data. These data are created by combining multiple datasets, often from different sources [2]. News datasets can be used for various purposes in a machine learning context, for example:
Predicting future events based on past news articles. Understanding the news cycle. Determining the sentiment of news articles. Extracting information from news articles (e.g., named entities, location, dates). Classifying news articles into predefined categories.To adequately answer research questions, news datasets should contain sufficient data points and span a significant enough period. There are many labeled news datasets available, each with specific limitations. For example, they may only cover a specific period or geographical area or be confined to a particular topic. Additionally, the categories may not be completely accurate, and the datasets may be biased in some way [3,4].
Some of the more popular news datasets include the 20 Newsgroups dataset [5], AG’s news topic classification dataset [6], L33-Yahoo News dataset [7,8], News Category dataset [9], and Media Cloud dataset [10]. Each of these datasets has been used extensively by researchers in the fields of natural language processing and machine learning, and each has its advantages and disadvantages. The 20 Newsgroups dataset was created in 1997 and contains 20 different categories of news, each with a training and test set. The data is already pre-processed and tokenized, which makes it very easy to use. However, the dataset is outdated and relatively small, with only about 1000 documents in each category.
The AG’s news topic classification dataset is a collection of news articles from the academic news search engine “ComeToMyHead” during more than one year of activity. Articles were classified into 13 categories: business, entertainment, Europe, health, Italia, music feeds, sci/tech, software & dev., sports, toons, top news, U.S., and world. The dataset contains more than 1 million news articles. However, there are several limitations to this dataset. First, it is currently outdated since data were collected in 2005. Second, the taxonomy covers specific countries such as the US and Italy but has general references such as Europe or world, creating overlaps in the classification (e.g., Italy and Europe) as well as potential imbalances (e.g., events in China are likely to be underrepresented and/or under-reported compared to those in the US). Finally, the dataset does not include methods for type or category description.
The L33-Yahoo News dataset is a collection of news articles from the Yahoo News website provided as part of the Yahoo! Webscope program. The articles are labeled into 414 categories such as music, movies, crime justice, and others. The dataset includes the random article id followed by possible associated categories. The L33-Yahoo News dataset is available under Yahoo’s data protection standards. It can be used for non-commercial purposes if researchers credit the source and license new creations under identical terms. The limitations of the L33 dataset are the license terms, restricting companies from using this dataset for commercial purposes, and the amount of data per class, with the category “supreme court decisions” having only five articles, for example. In addition, there is some overlap in the categories, which makes it challenging to train a model that can accurately predict multiple categories.
The News Category Dataset is a collection of around 210k news articles from the Huffington Post, labeled with their respective categories, which include business, entertainment, politics, science and technology, and sports. However, the dataset has several limitations. First, the dataset is not comprehensive since it only includes articles from one source. Second, news categories are not standardized, including broad categories such as “Media” and “Politics” and very narrow ones like “Weddings” and “Latino voices”.
The Media Cloud Data Set is a collection of over 1.7 billion articles from more than 60 thousand media sources around the world. The dataset includes articles from both mainstream and alternative news sources, including newspapers, magazines, blogs, and online news outlets. Data can be queried by keyword, tag, category, sentiment, and location. This dataset is useful for researchers who are interested in studying media coverage of specific topics or trends over time. Media Cloud is a large multilingual dataset that has good media coverage but limited use in topic classification models since it does not include a mapping of articles to a specific news taxonomy.
The main motivation for this work is to provide a dataset for building specific topic models. It consists of a categorized subset taken from an existing news dataset. We show that such a dataset, with up-to-date articles mapped into a standardized news taxonomy, can contribute to the accuracy improvement of news classification models.
In this paper, we present a new dataset based on the NELA-GT-2019 data source [11], classified with IPTC’s NewsCodes Media Topic taxonomy [12] (The International Press Telecommunications Council, or IPTC, is an organization that creates and maintains standards for exchanging news and other information between news organizations). The original NELA-GT-2019 dataset contains 1.12 M news articles from 260 sources collected between 1 January 2019 and 31 December 2019, providing essential content diversity and topic coverage. Sources include a wide range of mainstream and alternative news outlets.
In turn, the IPTC taxonomies are a set of controlled vocabularies used to describe news stories’ content. The NewsCodes Media Topic taxonomy has been one of IPTC’s main subject taxonomies for text classification since 2010. We used the 2020 version of NewsCodes Media Topic taxonomy [13]. News organizations use it to categorize and index their content, while search engines use it to improve the discoverability of news stories [14].
Algorithm of the article selection process: Obtain a random article from the NELA dataset;Classify it for the second-level category of the NewsCodes Media Topic taxonomy by checking the keywords and thorough reading of the article; the news article is assigned to exactly one category;
If there are already 100 articles in that category discard it, otherwise assign a second-level category to the article;
Return to step 1 and repeat until each second-level category has 100 articles assigned.The described algorithm allows for overcoming the limitation of the NELA-GT datasets where a large proportion of the dataset is fringe, conspiracy-based news due to the discharging of the news if a category already has 100 articles in it.
We observed that the first-level category of the NewsCodes Media Topic taxonomy is not accurate enough to catalogue an article. For example, the “sport” category may include different aspects, such as information about specific sports, sports event announcements, and the sports industry in general, which have more specific meanings than the first-level category label is able to convey. Therefore, we used a second-level category of NewsCodes Media Topic taxonomy to have a more specific article category. In comparison to the previously published datasets, we included in our dataset unique categories such as “arts and entertainment”, “mass media”, “armed conflict”, “weather statistic”, and “weather warning”. Therefore, we created the proposed Multilabeled News Dataset (MN-DS) by hand-picking and labeling approximately 100 news articles for each second level category (https://www.iptc.org/std/NewsCodes/treeview/mediatopic/mediatopic-en-GB.html (accessed on 13 March 2022)) of the NewsCodes Media Topic taxonomy.
After manually selecting news articles relevant to each category, we obtained 10,917 articles in 17 first-level and 109 second-level categories from 215 media sources. During the selection process, one article was processed by one coder. An overview of the released MN-DS dataset by category is provided in Table 1. All data are available in CSV format at https://doi.org/10.5281/zenodo.7394850 under a Creative Commons license.
The MN-DS contains articles published in 2019, the distribution of selected articles over the year is balanced with slightly more articles for the month of January 2019. The majority of the articles were selected from mainstream sources such as ABC News, the BBC, The Sun, TASS, The Guardian, Birmingham Mail, The Independent, Evening Standard, and others. The dataset also includes a relatively small percentage of articles from alternative sources such as Sputnik, FREEDOMBUNKER, or Daily Buzz Live.
To describe the dataset, we created a word cloud representation of each category, as shown in Figure 1. The central concept of a word cloud is to visualize for each category the most popular words with a size corresponding to the degree of popularity. This representation allows us to quickly assess the quality of the text annotation since it displays the most common words of the category. In the bar chart shown in Figure 2, we can observe that the “science and technology” first-level category contains the highest count of topic-specific words, while in more general categories, such as “weather” or “human interest”, there is less variety in the texts, probably because they represent shorter and more similar articles.
The purpose of this dataset is to provide labeled data to train and test classifiers to predict the topic of a news article. Since the MN-DS represent the subset of the NELA-GT dataset, it could be also used to study the veracity of news articles but is not limited to this application. Due to the nature of the NELA-GT dataset, the style of articles is less formal, and we expect it to be the best fit for the alternative/conspiracy sources or social media article classification.