Domain Spell Checker

The Domain specific Spell Checker tool mainly consists of three modules namely the Web scraping module, Text processing module and the Spell Checker tool.

Web Scraping module

Web scraping module is used to access and download the papers hosted in the BioRxiv site. User has to enter the number of papers he wants to download and the file location where he wants to save the papers.

Text processing module

This module is used to build word corpus from the extracted pdfs. User enters the file location where the papers are stored, number of papers to parse and the location where the corpus should be built.

Spell checker tool

Scala implementation of Peter Norvig's algorithm for spell checker. This tool takes a word as inout and checks if it is spelled correctly. If the word is spelled incorrectly, it returns a possible set of suggestions to the user.

Setting up the project

This project uses scala version "2.12.8" and sbt version "1.3.8". It also uses jsoup, apache pdfbox, httpcomponents, scalatest and log4j logging dependencies. These can be found in the build.sbt file.

To set up the project, clone the master branch to the local. Run the following commands inside the directory.

sbt compile

sbt assembly

sbt run

On running "sbt run", the main classes in the project are displayed.

To perform web scraping, choose option 2. Enter the number of papers to download and the file location to save the papers.

To parse pdfs and build the corpus, choose option 1. Enters the file location where the papers are stored, number of papers to parse and the location where the corpus should be built.

To use the Spell Checker functionality, choose option 3. Enter the owrd to check spelling and Q to exit out of tool.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
project		project
src		src
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Domain Spell Checker

Web Scraping module

Text processing module

Spell checker tool

Setting up the project

About

Releases

Packages

Languages

prnan4/domain-spell-checker

Folders and files

Latest commit

History

Repository files navigation

Domain Spell Checker

Web Scraping module

Text processing module

Spell checker tool

Setting up the project

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages