This application is capable of extracting pairs similar documents from a set of documents present in a corpus. Documents containing similar texts are marked as a plagiarized pair. Locality sensitive hashing is used to find similar pairs and various distance measures such as Jaccard distance, Cosine distance and Hamming distance are used in the process. For each distance measure, a set of predicted plagiarized pairs are returned.
- Returns pairs of plagiarised or similar documents, which are answers to a question in our corpus.
- Three different measures (Jaccard distance, Cosine distance, Hamming distance) can be used to find similar documents.
- The algorithm shows the precision and number of correct documents returned for each distance measure.
- The signature matrix needs to be generated only once for each distance measure.
- Fully documented code.
- Clone this repo / click "Download as Zip" and extract the files.
- Ensure Python 3.7 is installed, and in your system
PATH
. - Install pipenv using
pip install -U pipenv
. - In the project folder, run
pipenv install
to install all python dependencies. - Generate the shingle-document matrix by running:
pipenv run python matrix.py
. Matrix will be stored in shingles_matrix.csv. - To create the signature matrix:
- Jaccard distance:
pipenv run python jaccard_sig.py
. Signature matrix stores in jaccard_signatures.csv. - Cosine distance:
pipenv run python cosine_sig.py
. Signature matrix stores in cosine_signatures.csv. - Hamming distance:
pipenv run python hamming_sig.py
. Signature matrix stores in hamming_signatures.csv.
- Jaccard distance:
- To run the LSH algorithm:
pipenv run python main.py
.