Program to scrape and store posted jobs in the United States from www.indeed.com
Gets the next information from the website:
- original id generated by Indeed;
- job title (
job_title
) - posting date (
job_date
) - location (
job_loc
) - short description (
job_summary
) - salary (or salary range) in a list format (
job_salary
) - url of the job (
job_url
) - company name (
company_name
)
- Install all required packages from requirements.txt.
$ pip install -r requirements.txt
- Assign search parameters in the
parameters.py
:
positions
should be a list of strings with all positions names or key-words for search. Even if there is one word, keep it in the list:positions = ["auditor"]
- Run the
app.py
$ python3 app.py
- Scraping jobs by the key parameters: search key-words
- Cleaning / formatting data.
- Each scraping session saves the results as a csv data dump to the
data_dumps/
folder. - Each step of the scraping is logged into the
log.txt
with printing the outcomes in the console.
app.py
- enter pointmain.py
- the main workflow of the programindeed_com_scraper.py
- scraping functionality moduledumping.py
- data cleaning / formatting module saving data dumpslogger.py
- logging functionalityparameters.py
- keeping scraping parameters in separate module for easy access.
Additional:
db_scheme.py
ordb_scheme.sql
for initial database setup.requirements.txt
required python packages.
python 3
Packages:
pandas 1.4.2
requests 2.28.0
beautifulsoup4 4.11.1