Scrape Ratings from Glassdoor.com

A prototype for scraping glassdoor ratings for a given portfolio holdings file (powered by google's custom search API and Selenium)

Process Overview

Take a list of names from a holdings file
Determine the glassdoor homepage (via google's search api) for each company (using the "cleaned up" company name as the query)
Scrape glassdoor's main page for company information and top-level ratings and an additional modal/pop-up for granular ratings (mimicking user clicks via Selenium)
Organize and merge the output back as columns into the original holdings file.

This is a prototype for further development. While you might find snippets contained here helpful, there is still some hand-holding to get from one step to another, in addition to setting up a google custom search engine API account, VPN set-up, etc. Feel free to message me if you need any guidance.

Process Details

1. Create Google Queries from List of Company Names `clean_names.py`.

Removes common junk and other share class stuff from name. Input: "all_2020_12_18.csv". These is just a list of names we want to collect information on Output: company_queries_2020_12_18.csv

2. Find Top 10 Glassdoor Pages via Google API `get_glassdoor_page.py`

Input: company_queries_2020_12_18.csv Output: ./google_results/json/<company_id>.json

3. Unpack Google Results `unpack_google_searches.py`

Input: ./google_results/json/<company_id>.json Output: ./google_results/top_google_results_2020_12_18.csv

4. Download HTMLs for Top Results `get_ceo_rating.py`

Input: ./google_results/top_google_results_2020_12_18.csv Outputs:

./extracts/overview/<glassdor_link.html> (main page)
./extracts/overview_extra/<glassdor_link.html> (additional info)
./extracts/errors/<glassdor_link.html> (pages that encountered errors)

Note: Sleeps randomly (min 10 seconds, max 30 seconds)

5. Extract Structured Data from HTML `extract_from_html.py`

Input:

./extracts/overview/<glassdor_link.html> (main page)
./extracts/overview_extra/<glassdor_link.html> (additional info) Outputs:
./extracted_glassdoor.csv

Note: Uses multiprocessing to loop through all the raw html files

6. Build Final Output `build_final_output.py`

Formatting for output specifications; Uses company websites from original data to verify mapping with company homepage data item from glassdoor Input:

./extracted_glassdoor.csv Outputs:
./glassdoor_ratings.csv (main page)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scrape Ratings from Glassdoor.com

Process Overview

Process Details

1. Create Google Queries from List of Company Names `clean_names.py`.

2. Find Top 10 Glassdoor Pages via Google API `get_glassdoor_page.py`

3. Unpack Google Results `unpack_google_searches.py`

4. Download HTMLs for Top Results `get_ceo_rating.py`

5. Extract Structured Data from HTML `extract_from_html.py`

6. Build Final Output `build_final_output.py`

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
chrome_driver		chrome_driver
README.md		README.md
clean_names.py		clean_names.py
extract_from_html.py		extract_from_html.py
get_ceo_rating.py		get_ceo_rating.py
get_glassdoor_page.py		get_glassdoor_page.py
requirements.txt		requirements.txt
unpack_google_searches.py		unpack_google_searches.py

talsan/glassdoor

Folders and files

Latest commit

History

Repository files navigation

Scrape Ratings from Glassdoor.com

Process Overview

Process Details

1. Create Google Queries from List of Company Names clean_names.py.

2. Find Top 10 Glassdoor Pages via Google API get_glassdoor_page.py

3. Unpack Google Results unpack_google_searches.py

4. Download HTMLs for Top Results get_ceo_rating.py

5. Extract Structured Data from HTML extract_from_html.py

6. Build Final Output build_final_output.py

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. Create Google Queries from List of Company Names `clean_names.py`.

2. Find Top 10 Glassdoor Pages via Google API `get_glassdoor_page.py`

3. Unpack Google Results `unpack_google_searches.py`

4. Download HTMLs for Top Results `get_ceo_rating.py`

5. Extract Structured Data from HTML `extract_from_html.py`

6. Build Final Output `build_final_output.py`

Packages