#pulitzer-scraper
Digging through the new pulitzer.org site, I found the API used to pull winners by year or category. Calls are based on term id (tid), a three digit number that corresponds to a specific category. I wrote a quick node-based scraper that pulls together lists of all the Pulitzer prize winners and finalists in separate CSV files, as well as a list of all the revisions to categories over the years.
NOTE: Running the app requires 205 separate API calls to the Pulitzer site. Use responsibly.
##To run:
Clone repo:
git clone https://github.com/jonseitz/pulitzer-scraper.git
cd pulitzer-scraper
Install dependencies:
npm install
run app:
node app
##Pulitzer API endpoints
http://www.pulitzer.org/cms/api/1/winners/cat/ [tid] /raw.json
The tid
range for current categories goes from 204 ("Public Service") - 225 ("Music"):
- 204: Public Service
- 205: Breaking News Reporting
- 206: Investigative Reporting
- 207: Explanatory Reporting
- 208: Local Reporting
- 209: National Reporting
- 210: International Reporting
- 211: Feature Writing
- 212: Commentary
- 213: Criticism
- 214: Editorial Writing
- 215: Editorial Cartooning
- 216: Breaking News Photography
- 217: Feature Photography
- 218: Drama
- 219: Fiction
- 220: History
- 221: Blank
- 222: Biography or Autobiography
- 223: General Nonfiction
- 224: Poetry
- 225: Music
Deprecated/Special categories are in a separate range from 260 (Special Awards & Citations) - 278 (Beat Reportting):
- 260: Special Awards and Citations
- 261: Novel
- 262: Blank
- 263: Correspondence
- 264: Explanatory Journalism
- 265: General News Reporting
- 266: Local General or Spot News Reporting
- 267: Local Investigative Specialized Reporting
- 268: Local Reporting - Edition time
- 269: Local Reporting - No edition time
- 270: Newspaper History Award
- 271: Photography
- 272: Reporting
- 273: Specialized Reporting
- 274: Spot News Photography
- 275: Spot News Reporting
- 276: Telegraphic Reporting - International
- 277: Telegraphic Reporting - National
- 278: Beat Reporting
http://www.pulitzer.org/cms/api/1/finalist/ [tid] /all/raw.json
tid
ranges mirror those above.
http://www.pulitzer.org/cms/api/1/winners/year/ [tid] /raw.json
The tid
range for years goes from 105 (2015) - 203 (1917).
http://www.pulitzer.org/cache/api/1/finalist/all/ [tid] /raw.json
tid
ranges mirror those above.
http://www.pulitzer.org/cms/api/1/term/ [tid] /raw.json
Enter any tid
to translate the category or year id into a real category.
FYI: tid
values above 278 appear to be categories or tags. The highest value I found is 549, which is "Alaska".