Data
Welcome to the data page! On this page, I inform users where I got my data from and the hurdles I had to overcome to make my visualization with it. I also included some discoveries I found through the process, and there are code snippets that may help make your project with this data or assist in understanding the problems I faced.
Data Source
The dataset I used was from The New York Times's The Archive API--an API that returns a collection of NYT articles for a given month from 1851 until 2020. You simply need to put the year and month to the API request call, and it returns all articles for that month.
Example Call
https://api.nytimes.com/svc/archive/v1/{year}/{month}.json?api-key={yourkey}
The articles that I used were U.S. news articles, because I wanted to contain my scope to the United States and not globally for now. Also, there are many different material sources that I did not want to use (such as Biography, Blogs, or Letters) because I felt they were too much to explore. Thus, I chose to filter out only USA news articles because I believed it encapsulated the theme of exploring the change of history, for news articles are a way to look back on past events.
Schema
The data that is returned from the API call is very rich. For an article, it includes information such as title, leading paragraph, word count, publish date, keywords, author, newsdesk, and much more. I am currently using keywords and publishing dates to build my current visualizations, but my future plan involves creating other visualizations that include other data. Listed below are some labels that I am using so far:
- Keywords: a list of keywords that are used to describe the content of the document. The keyword has info on the type of category it under, ranking, and major.
- Pub_date: the timestamp of when the document was published.
- Document_type: the type of document that is returned back such as an article or paid post for instance.
- Type_of_material: the type of material that is being covered in the document such as news, blog, video, summary, letter, etc.
- Section_name: this describes the section or category that the document is under such as either arts, books, education, fashion, world, U.S., and other types.
License
"The NYT APIs are owned by NYT and are licensed to you on a worldwide (except as limited below), non-exclusive, non-sublicenseable basis on the terms and conditions set forth herein. These Terms of Use define legal use of the NYT APIs, all updates, revisions, substitutions, and any copies of the NYT APIs made by or for you. All rights not expressly granted to you are reserved by NYT."
-https://developer.nytimes.com/terms
Wrangling
After countless trials and errors, I collected the New York Times U.S. news articles from 1981 till 2020. I started in 1981 because I noticed that keywords became more prominent in this particular year when I was reviewing the data. Also, I did not include articles that have missing document type or type of material fields; it seems that some documents are missing one of these fields. Considering I used both of those fields to determine if a document is a news article, I felt it was not right to assume it is an article if it was news or vise versa. Therefore, I decided not to count it but instead count how many times these occurrences happened.
(You can also check it out at the Gist page!)
After finally getting all the data and having a chance to test it out, I realized there were more problems than I expected. The two major challenges that I faced were duplicates of the same articles and irregular special characters in keywords. I initially thought the duplicated articles were maybe corrections of previous articles, but in reality that was not the case. Further investigation revealed that the articles had the same publish date and information while having no indication of it being a correction. This ended up being a major problem I had to fix, for this breaks the integrity of the data. I resolved this issue by creating a function that removes all the duplicate articles in the data. At the same time, I realized that some keywords had irregular symbols such as "<", "[","-", "@"", etc. These symbols seemed randomly placed because it did not make sense and did not add any value to the article. Hence, I replaced those irregular symbols with an empty space, and I also removed one letter or empty keywords from the data itself.
Year | OldSize (with duplicates) | missed | NewSize (after removing duplicates) | Difference in size |
1981 | 15558 | 0 | 10330 | 5228 |
1982 | 15649 | 0 | 10408 | 5241 |
1983 | 14336 | 0 | 9515 | 4821 |
1984 | 18371 | 0 | 12142 | 6229 |
1985 | 16497 | 0 | 10955 | 5542 |
1986 | 16773 | 0 | 11225 | 5548 |
1987 | 15065 | 0 | 9972 | 5093 |
1988 | 15913 | 0 | 10537 | 5376 |
1989 | 12451 | 0 | 8265 | 4186 |
1990 | 10500 | 0 | 6985 | 3515 |
1991 | 10241 | 0 | 6762 | 3479 |
1992 | 8927 | 0 | 5905 | 3022 |
1993 | 7476 | 0 | 4994 | 2482 |
1994 | 7114 | 0 | 4727 | 2387 |
1995 | 7964 | 0 | 5313 | 2651 |
1995 | 7964 | 0 | 5313 | 2651 |
1996 | 8917 | 45 | 5926 | 2991 |
1997 | 6952 | 0 | 4648 | 2304 |
1998 | 7449 | 10 | 4928 | 2521 |
1999 | 6773 | 24 | 4493 | 2280 |
2000 | 10111 | 1 | 6741 | 3370 |
2001 | 12498 | 1 | 8333 | 4165 |
2002 | 12277 | 9 | 8198 | 4079 |
2003 | 10419 | 6 | 6950 | 3469 |
2004 | 13424 | 3 | 8876 | 4548 |
2005 | 13664 | 9 | 9108 | 4556 |
2006 | 15668 | 90 | 10377 | 5291 |
2007 | 9764 | 246 | 6501 | 3263 |
2008 | 12334 | 414 | 8191 | 4143 |
2009 | 9030 | 16308 | 6013 | 3017 |
2010 | 6493 | 68510 | 4306 | 2187 |
2011 | 10346 | 24673 | 6945 | 3401 |
2012 | 12275 | 0 | 8148 | 4127 |
2013 | 7677 | 0 | 5101 | 2576 |
2014 | 9527 | 0 | 6271 | 3256 |
2015 | 11172 | 0 | 7416 | 3756 |
2016 | 10363 | 0 | 6858 | 3505 |
2017 | 8266 | 0 | 5498 | 2768 |
2018 | 4987 | 13098 | 3318 | 1669 |
2019 | 10781 | 0 | 7208 | 3573 |
2020 | 9244 | 0 | 9244 | 0 |
Total | 443246 | 123447 | 297631 | 145615 |
**note: I do not know what is the accurate missed after removing duplicates because I never added missing field articles in the first place. So, I do not know if those missed articles had or were duplicates.
Besides the massive clean up of the data, I also did some modification to the data to better improve the interaction between the visualization and users. One thing I did was make the keywords lowercase. Doing so allowed for a more accurate count of the keywords and made searching more efficient. For instance, some keywords had leading capitalization or full capitalization for all words. An example of this was "United States" and "UNITED STATES." These keywords should be treated as the same despite being capitalized differently. In the future, I am thinking of stemming the keywords further so I can include acronyms and different tenses of a word into the one count. The other customization is for names of people because NYT wrote people names in "last name, first name, middle initial" order. Even though that order is not an issue, I felt that this could cause confusion for the users when searching and viewing the visualizations. So, I parsed the names into the common "first name, middle initial, last name" order that is familiar to most people.
(You can also check it out at the Gist page!)