TEDscraper

Scrape TED talk data including transcripts in over 100 languages from TED.com

Requirements

Python 3
Requests
Beautiful Soup 4
fake-useragent
Pandas

Usage

# move to TEDscraper directory
# import module (or use Jupyter Notebook)
from TEDscraper import TEDscraper

# instantiate the scraper & pass in optional arguments
scraper = TEDscraper(lang_code='en', urls='all', topics='all')

# scrape the data and save it to a dictionary
ted_dict = scraper.get_data()

# transform the dictionary to a sorted pandas DataFrame
df = scraper.to_dataframe(ted_dict)

# output DataFrame as CSV
df.to_csv('../data/ted_talks.csv', index=False)

Here is a list of other output formats Pandas docs.

Parameters

lang_code
- English is the default language lang_code='en'
- You can pass in other language codes using the lang_code param
- TED translators don't always translate all features
  - Ex: Title and 'About Speaker' might be in English while the transcript is translated to French
urls
- All urls are scraped by default for the selected language urls='all'
- You may pass in a list of urls. However, there are a few limitations:
  - TED must have the talks available in the language you specify
  - Only one language can be provided per scrape call
topics
- All topics are scraped by default topics='all'
- You may pass in a list of topics to filter by them
force_fetch
- Talks with known issues are skipped by default force_fetch=False
- Set it to 'True' to attempt to scrape
- See talks with known issues
exclude_transcript
- All features are scraped by default exclude_transcript=False
- Set it to 'True' to exclude the transcript

Attributes

Attribute	Description	Data Type
talk_id	Talk identification number provided by TED	int
title	Title of the talk	string
speaker_1	First speaker in TED's speaker list	string
speakers	Speakers in the talk	dictionary
occupations	*Occupations of the speakers	dictionary
about_speakers	*Blurb about each speaker	dictionary
views	Count of views	int
recorded_date	Date the talk was recorded	string
published_date	Date the talk was published to TED.com	string
event	Event or medium in which the talk was given	string
native_lang	Language the talk was given in	string
available_lang	All available languages (lang_code) for a talk	list
comments	Count of comments	int
duration	Duration in seconds	int
topics	Related tags or topics for the talk	list
related_talks	Related talks (key='talk_id', value='title')	dictionary
url	URL of the talk	string
description	Description of the talk	string
transcript	Full transcript of the talk	string

*The dictionary key maps to the speaker in ‘speakers’.

Languages

TED talks have been subtitled in over 100 languages. Here are the top languages:

Code	Language
en	English
es	Spanish
pt-br	Portuguese (Brazilian)
fr	French
it	Italian
zh-cn	Chinese (simplified)
zh-tw	Chinese (traditional)
ko	Korean
ja	Japanese
tr	Turkish
ru	Russian
he	Hebrew

Here is a link to all language codes available as of May 2020.

You can see all the talks for each language at TED – Our Languages.

Contributing

Fork it (https://github.com/yourname/yourproject/fork)
Create your feature branch (git checkout -b feature/fooBar)
Commit your changes (git commit -am 'Add some fooBar')
Push to the branch (git push origin feature/fooBar)
Create a new Pull Request

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
TEDscraper		TEDscraper
data		data
images		images
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TEDscraper

Requirements

Usage

Parameters

Attributes

Languages

Meta

Contributing

About

Contributors 2

Languages

License

corralm/ted-scraper

Folders and files

Latest commit

History

Repository files navigation

TEDscraper

Requirements

Usage

Parameters

Attributes

Languages

Meta

Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Contributors 2

Languages