Medium Scraper

medium-scraper (MS) is a scraper build using Scrapy framework for scrape Medium posts.

🔧 How it works

MS consist of two scrapy spider: post_id and post.

First MS the look for the post_id (a unique identifier of a post) inside the Sitemap of medium website and store the found post_id in a SQLite.

Then the second spider (post) takes the post_id from database and permform a request for obstain the specific data for every post. Data about a post can be divided in two groups: posts, and paragraphs. These are store in different tables inside database.

📚 Database structure

All scraped data are store in a SQLite file (.db). To Create a new databese file create a duplicate of example.db and then rename it to medium.db (medium.db is the default name for the database). If you want use a grafical interface for interact with databese I suggest DB Browser for SQLite.

As I said before database consists of two tables: post and paragraph.

post

post_id	available	creator_id	language	first_published_at	title	word_count	claps	tags
`316d066db3d6`	`1`	`245c7224d0ce`	`en`	`1577865630099`	`Intro...`	`4231`	`341`	`cow,dog`
`5edbf9af44af`	`0`	`NULL`	`NULL`	`NULL`	`NULL`	`NULL`	`NULL`	`NULL`
`fec8331faa9d`	`NULL`	`NULL`	`NULL`	`NULL`	`NULL`	`NULL`	`NULL`	`NULL`
`...`	`...`	`...`	`...`	`...`	`...`	`...`	`...`	`...`

post_id
- a unique identifier for the post
available
- NULL the post spider never try to scrape this post_id
- 1 the post spider scrape succesfully this post_id
- 0 the post spider faild to scrape this post_id
creator_id
- a unique identifier for the creator of the post
language
- the language of the content of the post (detected by Medium)
first_published_at
- timestamp (milliseconds) of the first pubblication of the post
title
- the title of the post (can be not unique)
word_count
- the number of words contain in the post content
claps
- the total number of claps (on medium.com claps == likes)
tags
- the tags related to the post (comma separated)

paragraph

post_id	index	name	type	text
`316d066db3d6`	`0`	`6f86`	`3`	`One important thing productful ...`
`316d066db3d6`	`1`	`eabd`	`1`	`Quality ≠ Money`
`3526667dacfb`	`0`	`94db`	`1`	`Income in Development ...`
`...`	`...`	`...`	`...`	`...`

post_id
- a unique identifier for the post
index
- the order of the paragraphs of a post (starting from 0)
name
- a unique identifier for the paragraph (inside post)
type
- 1 normal
- 3 big bold header
- 6 quote
- 7 quote bigger and in the center
- 9 bullet list
- 10 ordered list
- 13 small bold header
text
- the text inside the paragraph

Information about italic, bold, code and link stored in the markup list, currently not scraped by MS

⬇️ Installation

Clone this repo: git clone https://github.com/S1M0N38/medium-scraper.git
Move inside the cloned repo: cd scraper-medium
Install dependecies with pipenv: pipenv install
Enter the virtualenv: pipenv shell
Check the installation: scrapy version

⚡ Usage

First you need ad .db where store data read Database Structure. Then be sure to be at the root level of medium-scraper repo and activate the virtualenv with pipenv shell

post_id spider

Description this spider populate the post_id column of the post table
Arguments if no arguemnt is provide, this spider start scraping the whole site starting from the foundation year of Medium and save all the data in the medium.db file. With spider arguments (-a) you can specify year, month and day. With settings arguments (-s) you can specify the name of the SQLite database. Of course you have to create the .db (e.g. cp example.db another_database.db)
Examples
- scrapy crawl post_id scarpe post_id of whole website (not recommended)
- scrapy crawl post_id -a year=2020 scarpe post_id of posts published in 2020
- scrapy crawl post_id -a year=2020 -a month=01 scarpe post_id of posts published in Jan 2020
- scrapy crawl post_id -a year=2020 -a month=01 -a day=01 scarpe post_id of posts published on 1st of Jan 2020
- scrapy crawl post_id -a year=2020 -a month=01 -s DB=another_database.db scarpe post_id of posts published in Jan 2020 and save on another_database.db

post spider

Description Look in the database for post_id with NULL available and collect more information saved in post and paragraph tables.
Arguments You can specify on which db store data
Examples
- scrapy crawl post scrape post data and save on medium.db
- scrapy crawl post -s DB=another_database.db scrape post data and save on another_database.db

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
medium		medium
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
example.sqlite		example.sqlite
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Medium Scraper

🔧 How it works

📚 Database structure

post

paragraph

⬇️ Installation

⚡ Usage

post_id spider

post spider

About

Releases

Packages

Contributors 2

Languages

License

S1M0N38/medium-scraper

Folders and files

Latest commit

History

Repository files navigation

Medium Scraper

🔧 How it works

📚 Database structure

post

paragraph

⬇️ Installation

⚡ Usage

post_id spider

post spider

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages