Data

Welcome to the data page! On this page, I inform users where I got my data from and the hurdles I had to overcome to make my visualization with it. I also included some discoveries I found through the process, and there are code snippets that may help make your project with this data or assist in understanding the problems I faced.

Data Source

The dataset I used was from The New York Times's The Archive API--an API that returns a collection of NYT articles for a given month from 1851 until 2020. You simply need to put the year and month to the API request call, and it returns all articles for that month.

Example Call

https://api.nytimes.com/svc/archive/v1/{year}/{month}.json?api-key={yourkey}

The articles that I used were U.S. news articles, because I wanted to contain my scope to the United States and not globally for now. Also, there are many different material sources that I did not want to use (such as Biography, Blogs, or Letters) because I felt they were too much to explore. Thus, I chose to filter out only USA news articles because I believed it encapsulated the theme of exploring the change of history, for news articles are a way to look back on past events.

Schema

The data that is returned from the API call is very rich. For an article, it includes information such as title, leading paragraph, word count, publish date, keywords, author, newsdesk, and much more. I am currently using keywords and publishing dates to build my current visualizations, but my future plan involves creating other visualizations that include other data. Listed below are some labels that I am using so far:

Keywords: a list of keywords that are used to describe the content of the document. The keyword has info on the type of category it under, ranking, and major.
Pub_date: the timestamp of when the document was published.
Document_type: the type of document that is returned back such as an article or paid post for instance.
Type_of_material: the type of material that is being covered in the document such as news, blog, video, summary, letter, etc.
Section_name: this describes the section or category that the document is under such as either arts, books, education, fashion, world, U.S., and other types.

License

"The NYT APIs are owned by NYT and are licensed to you on a worldwide (except as limited below), non-exclusive, non-sublicenseable basis on the terms and conditions set forth herein. These Terms of Use define legal use of the NYT APIs, all updates, revisions, substitutions, and any copies of the NYT APIs made by or for you. All rights not expressly granted to you are reserved by NYT."
-https://developer.nytimes.com/terms

Wrangling

After countless trials and errors, I collected the New York Times U.S. news articles from 1981 till 2020. I started in 1981 because I noticed that keywords became more prominent in this particular year when I was reviewing the data. Also, I did not include articles that have missing document type or type of material fields; it seems that some documents are missing one of these fields. Considering I used both of those fields to determine if a document is a news article, I felt it was not right to assume it is an article if it was news or vise versa. Therefore, I decided not to count it but instead count how many times these occurrences happened.

(You can also check it out at the Gist page!)

After finally getting all the data and having a chance to test it out, I realized there were more problems than I expected. The two major challenges that I faced were duplicates of the same articles and irregular special characters in keywords. I initially thought the duplicated articles were maybe corrections of previous articles, but in reality that was not the case. Further investigation revealed that the articles had the same publish date and information while having no indication of it being a correction. This ended up being a major problem I had to fix, for this breaks the integrity of the data. I resolved this issue by creating a function that removes all the duplicate articles in the data. At the same time, I realized that some keywords had irregular symbols such as "<", "[","-", "@"", etc. These symbols seemed randomly placed because it did not make sense and did not add any value to the article. Hence, I replaced those irregular symbols with an empty space, and I also removed one letter or empty keywords from the data itself.

Year	OldSize (with duplicates)	missed	NewSize (after removing duplicates)	Difference in size
1981	15558	0	10330	5228
1982	15649	0	10408	5241
1983	14336	0	9515	4821
1984	18371	0	12142	6229
1985	16497	0	10955	5542
1986	16773	0	11225	5548
1987	15065	0	9972	5093
1988	15913	0	10537	5376
1989	12451	0	8265	4186
1990	10500	0	6985	3515
1991	10241	0	6762	3479
1992	8927	0	5905	3022
1993	7476	0	4994	2482
1994	7114	0	4727	2387
1995	7964	0	5313	2651
1995	7964	0	5313	2651
1996	8917	45	5926	2991
1997	6952	0	4648	2304
1998	7449	10	4928	2521
1999	6773	24	4493	2280
2000	10111	1	6741	3370
2001	12498	1	8333	4165
2002	12277	9	8198	4079
2003	10419	6	6950	3469
2004	13424	3	8876	4548
2005	13664	9	9108	4556
2006	15668	90	10377	5291
2007	9764	246	6501	3263
2008	12334	414	8191	4143
2009	9030	16308	6013	3017
2010	6493	68510	4306	2187
2011	10346	24673	6945	3401
2012	12275	0	8148	4127
2013	7677	0	5101	2576
2014	9527	0	6271	3256
2015	11172	0	7416	3756
2016	10363	0	6858	3505
2017	8266	0	5498	2768
2018	4987	13098	3318	1669
2019	10781	0	7208	3573
2020	9244	0	9244	0
Total	443246	123447	297631	145615

**note: I do not know what is the accurate missed after removing duplicates because I never added missing field articles in the first place. So, I do not know if those missed articles had or were duplicates.

Besides the massive clean up of the data, I also did some modification to the data to better improve the interaction between the visualization and users. One thing I did was make the keywords lowercase. Doing so allowed for a more accurate count of the keywords and made searching more efficient. For instance, some keywords had leading capitalization or full capitalization for all words. An example of this was "United States" and "UNITED STATES." These keywords should be treated as the same despite being capitalized differently. In the future, I am thinking of stemming the keywords further so I can include acronyms and different tenses of a word into the one count. The other customization is for names of people because NYT wrote people names in "last name, first name, middle initial" order. Even though that order is not an issue, I felt that this could cause confusion for the users when searching and viewing the visualizations. So, I parsed the names into the common "first name, middle initial, last name" order that is familiar to most people.