Does History Repeat Itself ?

Data

Welcome to the data page! On this page, I inform users where I got my data from and the hurdles I had to overcome to make my visualization with it. I also included some discoveries I found through the process, and there are code snippets that may help make your project with this data or assist in understanding the problems I faced.

Data Source

The dataset I used was from The New York Times's The Archive API--an API that returns a collection of NYT articles for a given month from 1851 until 2020. You simply need to put the year and month to the API request call, and it returns all articles for that month.

Example Call

https://api.nytimes.com/svc/archive/v1/{year}/{month}.json?api-key={yourkey}

The articles that I used were U.S. news articles, because I wanted to contain my scope to the United States and not globally for now. Also, there are many different material sources that I did not want to use (such as Biography, Blogs, or Letters) because I felt they were too much to explore. Thus, I chose to filter out only USA news articles because I believed it encapsulated the theme of exploring the change of history, for news articles are a way to look back on past events.

Schema

The data that is returned from the API call is very rich. For an article, it includes information such as title, leading paragraph, word count, publish date, keywords, author, newsdesk, and much more. I am currently using keywords and publishing dates to build my current visualizations, but my future plan involves creating other visualizations that include other data. Listed below are some labels that I am using so far:

  1. Keywords: a list of keywords that are used to describe the content of the document. The keyword has info on the type of category it under, ranking, and major.
  2. Pub_date: the timestamp of when the document was published.
  3. Document_type: the type of document that is returned back such as an article or paid post for instance.
  4. Type_of_material: the type of material that is being covered in the document such as news, blog, video, summary, letter, etc.
  5. Section_name: this describes the section or category that the document is under such as either arts, books, education, fashion, world, U.S., and other types.

License

"The NYT APIs are owned by NYT and are licensed to you on a worldwide (except as limited below), non-exclusive, non-sublicenseable basis on the terms and conditions set forth herein. These Terms of Use define legal use of the NYT APIs, all updates, revisions, substitutions, and any copies of the NYT APIs made by or for you. All rights not expressly granted to you are reserved by NYT."
-https://developer.nytimes.com/terms

Wrangling

After countless trials and errors, I collected the New York Times U.S. news articles from 1981 till 2020. I started in 1981 because I noticed that keywords became more prominent in this particular year when I was reviewing the data. Also, I did not include articles that have missing document type or type of material fields; it seems that some documents are missing one of these fields. Considering I used both of those fields to determine if a document is a news article, I felt it was not right to assume it is an article if it was news or vise versa. Therefore, I decided not to count it but instead count how many times these occurrences happened.

(You can also check it out at the Gist page!)

After finally getting all the data and having a chance to test it out, I realized there were more problems than I expected. The two major challenges that I faced were duplicates of the same articles and irregular special characters in keywords. I initially thought the duplicated articles were maybe corrections of previous articles, but in reality that was not the case. Further investigation revealed that the articles had the same publish date and information while having no indication of it being a correction. This ended up being a major problem I had to fix, for this breaks the integrity of the data. I resolved this issue by creating a function that removes all the duplicate articles in the data. At the same time, I realized that some keywords had irregular symbols such as "<", "[","-", "@"", etc. These symbols seemed randomly placed because it did not make sense and did not add any value to the article. Hence, I replaced those irregular symbols with an empty space, and I also removed one letter or empty keywords from the data itself.

Year OldSize (with duplicates) missed NewSize (after removing duplicates) Difference in size
1981 15558 0 10330 5228
1982 15649 0 10408 5241
1983 14336 0 9515 4821
1984 18371 0 12142 6229
1985 16497 0 10955 5542
1986 16773 0 11225 5548
1987 15065 0 9972 5093
1988 15913 0 10537 5376
1989 12451 0 8265 4186
1990 10500 0 6985 3515
1991 10241 0 6762 3479
1992 8927 0 5905 3022
1993 7476 0 4994 2482
1994 7114 0 4727 2387
1995 7964 0 5313 2651
1995 7964 0 5313 2651
1996 8917 45 5926 2991
1997 6952 0 4648 2304
1998 7449 10 4928 2521
1999 6773 24 4493 2280
2000 10111 1 6741 3370
2001 12498 1 8333 4165
2002 12277 9 8198 4079
2003 10419 6 6950 3469
2004 13424 3 8876 4548
2005 13664 9 9108 4556
2006 15668 90 10377 5291
2007 9764 246 6501 3263
2008 12334 414 8191 4143
2009 9030 16308 6013 3017
2010 6493 68510 4306 2187
2011 10346 24673 6945 3401
2012 12275 0 8148 4127
2013 7677 0 5101 2576
2014 9527 0 6271 3256
2015 11172 0 7416 3756
2016 10363 0 6858 3505
2017 8266 0 5498 2768
2018 4987 13098 3318 1669
2019 10781 0 7208 3573
2020 9244 0 9244 0
Total 443246 123447 297631 145615

**note: I do not know what is the accurate missed after removing duplicates because I never added missing field articles in the first place. So, I do not know if those missed articles had or were duplicates.


Besides the massive clean up of the data, I also did some modification to the data to better improve the interaction between the visualization and users. One thing I did was make the keywords lowercase. Doing so allowed for a more accurate count of the keywords and made searching more efficient. For instance, some keywords had leading capitalization or full capitalization for all words. An example of this was "United States" and "UNITED STATES." These keywords should be treated as the same despite being capitalized differently. In the future, I am thinking of stemming the keywords further so I can include acronyms and different tenses of a word into the one count. The other customization is for names of people because NYT wrote people names in "last name, first name, middle initial" order. Even though that order is not an issue, I felt that this could cause confusion for the users when searching and viewing the visualizations. So, I parsed the names into the common "first name, middle initial, last name" order that is familiar to most people.

(You can also check it out at the Gist page!)