Scrape TED talk data including transcripts in over 100 languages from TED.com
Python 3
Requests
Beautiful Soup 4
fake-useragent
Pandas
# move to TEDscraper directory
# import module (or use Jupyter Notebook)
from TEDscraper import TEDscraper
# instantiate the scraper & pass in optional arguments
scraper = TEDscraper(lang_code='en', urls='all', topics='all')
# scrape the data and save it to a dictionary
ted_dict = scraper.get_data()
# transform the dictionary to a sorted pandas DataFrame
df = scraper.to_dataframe(ted_dict)
# output DataFrame as CSV
df.to_csv('../data/ted_talks.csv', index=False)
Here is a list of other output formats Pandas docs.
- lang_code
- English is the default language
lang_code='en'
- You can pass in other language codes using the
lang_code
param - TED translators don't always translate all features
- Ex: Title and 'About Speaker' might be in English while the transcript is translated to French
- English is the default language
- urls
- All urls are scraped by default for the selected language
urls='all'
- You may pass in a list of urls. However, there are a few limitations:
- TED must have the talks available in the language you specify
- Only one language can be provided per scrape call
- All urls are scraped by default for the selected language
- topics
- All topics are scraped by default
topics='all'
- You may pass in a list of topics to filter by them
- All topics are scraped by default
- force_fetch
- Talks with known issues are skipped by default
force_fetch=False
- Set it to 'True' to attempt to scrape
- See talks with known issues
- Talks with known issues are skipped by default
- exclude_transcript
- All features are scraped by default
exclude_transcript=False
- Set it to 'True' to exclude the transcript
- All features are scraped by default
Attribute | Description | Data Type |
---|---|---|
talk_id | Talk identification number provided by TED | int |
title | Title of the talk | string |
speaker_1 | First speaker in TED's speaker list | string |
speakers | Speakers in the talk | dictionary |
occupations | *Occupations of the speakers | dictionary |
about_speakers | *Blurb about each speaker | dictionary |
views | Count of views | int |
recorded_date | Date the talk was recorded | string |
published_date | Date the talk was published to TED.com | string |
event | Event or medium in which the talk was given | string |
native_lang | Language the talk was given in | string |
available_lang | All available languages (lang_code) for a talk | list |
comments | Count of comments | int |
duration | Duration in seconds | int |
topics | Related tags or topics for the talk | list |
related_talks | Related talks (key='talk_id', value='title') | dictionary |
url | URL of the talk | string |
description | Description of the talk | string |
transcript | Full transcript of the talk | string |
*The dictionary key maps to the speaker in ‘speakers’.
TED talks have been subtitled in over 100 languages. Here are the top languages:
Code | Language |
---|---|
en | English |
es | Spanish |
pt-br | Portuguese (Brazilian) |
fr | French |
it | Italian |
zh-cn | Chinese (simplified) |
zh-tw | Chinese (traditional) |
ko | Korean |
ja | Japanese |
tr | Turkish |
ru | Russian |
he | Hebrew |
Here is a link to all language codes available as of May 2020.
You can see all the talks for each language at TED – Our Languages.
Author: Miguel Corral Jr.
Email: [email protected]
LinkedIn: https://www.linkedin.com/in/miguelcorraljr/
GitHub: https://github.com/corralm
Distributed under the MIT license. See LICENSE for more information.
- Fork it (https://github.com/yourname/yourproject/fork)
- Create your feature branch (
git checkout -b feature/fooBar
) - Commit your changes (
git commit -am 'Add some fooBar'
) - Push to the branch (
git push origin feature/fooBar
) - Create a new Pull Request