The motivation for building this system is to provide AI-powered data-driven prediction assistance to judicial practitioners to make a better decision. To meet the demand of solving the humongous load of pending cases, we have resorted to the modern-day techniques of using ML and AI to improve the efficiency of the process. This CJPR
system brings a wave of revolution in the legal system where with the help of this model we can provide legal practitioners better insight into the case by giving them relevant historical cases and provide assistance to them for providing a better result.
- β Features
β οΈ Frameworks and Libraries- π Datasets
- π Prerequisites
- π‘ Recommendations
- π Directory Tree
- πΒ Installation & Running
- π Results
- π Wandb
- Prediction of Court Petitions: CJPR is able to predict the court petitions based on the given case description.
- Recommendation on Acceptance CJPR is able to recommend
(If Petition is Accepted)
similar historical cases based on the given case description. - Easy to Access: This system is deployed on docker and pushed to docker hub for easy access. Anyone can access this system by just pulling the docker image from docker hub & running it on their local machine.
- Hugging Face: Hugging Face is an NLP-focused startup with a large open-source community, in particular around the Transformers library.
- Sci-kit Learn: Simple and efficient tools for predictive data analysis.
- Tensorflow / Keras: Deep learning framework used to build and train our models.
- Pytorch: Deep learning framework used to build and train our models.
- Numpy: NumPy is a Python library used for working with arrays.
- Pandas: Pandas is a Python library used for working with data sets.
- Matplotlib: Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python.
- Beautiful Soup: Beautiful Soup is a Python library for pulling data out of HTML and XML files.
- Docker: Docker is a set of platform as a service products that use OS-level virtualization to deliver software in packages called containers.
The Dataset used for this project is ILDC Large
dataset. The dataset contains 54,000 court cases from the Supreme Court of India. This data is scraped from India Kanoon website. ILDC Large
only contains data from Supreme court of India. The dataset contains the following columns:
ID
: Unique ID for each caseText
: Petiton text of the caseDecision
: Decision of the case (1: Accepted, 0: Rejected)Label
: Label of the case (1: Criminal, 0: Civil)Year
: Year of the case
Dataset is distributed as follows:
Dataset | No.of Cases | Percentage | Purpose |
---|---|---|---|
Train | 34,655 | 64% | Training the model |
Validation | 10,830 | 20% | Validating the model |
Test | 8,664 | 16% | Testing the model |
Data preprocessing is an important step that improves the quality and consistency of the raw legal text data that has been collected from indiakanoon.org. The first step is to get rid of any spaces that donβt break the flow of text. After that, the data is split into lines, and each line goes through sentence-level cleaning to get rid of characters which arenβt needed. Then, abbreviations are explained in full, and any formatting problems are fixed. The main content is taken out, and checks make sure the data is correct. The final preprocessed data is gathered, which gives the next analyses a clean and solid base. All these steps are done using the python Regular Expressions
.
All the prerequisites are mentioned in the code file itself to run the code. But as the application is deployed on docker, you just need to pull the docker image from docker hub and run it on your local machine. The docker image is available on docker hub with the name chagantireddy/cjpr:latest
. But the dependencies for the docker image are mentioned in the requirements.txt file.
The recommendations are provided based on the cosine similarity between the given case description and the historical cases. The cosine similarity is calculated using the
$$ Cosine Similarity(A,B) = \frac{\sum*{i=1} A_i . B_i}{\sqrt{\sum*{i=1} {Ai}^2}\sqrt{\sum{i=1} {B_i}^2}} $$
Where,
.
βββ assets
βββ CJPR_docker
βββ Classical
βΒ Β βββ Logistic
βΒ Β βββ Random_Forest
βΒ Β βββ XGBOOST
βββ Papers
βββ test_cases
βββ TPU
βΒ Β βββ albert
βΒ Β βββ bert
βΒ Β βββ deberta
βΒ Β βββ distilbert
βΒ Β βββ roberta
βΒ Β βββ xlnet
βββ Transformers-GPU
βββ albert
βββ bert
βββ deberta
βββ distilbert
βββ roberta
βββ xlnet
23 directories
- Pull the docker image from docker hub
$ docker pull chagantireddy/cjpr:latest
-
All the instructions to run the docker image are mentioned in the dockerhub itself for referencing purpose. But the instructions are also mentioned below.
-
If you running the image for the first time then run the following command to create a container from the image.
$ docker run -it --name CJPR <IMAGE_ID>
- Get the Image ID using below command and then find for chagantireddy:cjpr and copy the IMAGE ID
$ docker images
- If you have already created a container from the image then you have to copy the test data to the container from test_cases directory. For that run the following command.
$ docker cp <file_path> CJPR:/app/test
- Now you can run the following command to run the application.
$ docker start CJPR
$ docker attach CJPR
- Now your output is stored in the container itself. You can copy the output to your local machine by running the following command.
$ docker cp CJPR:/app/recommanded_petitions <output_path>
The screenshot of the application running looks like:
The CJPR system is able to predict and recommend the test_cases which are not trained on the model. The results are shown below:
Now you can copy the results to your local machine by running the above given command.
.
βββ Petition0.txt
βββ Petition1.txt
βββ Petition2.txt
βββ Petition3.txt
βββ Petition4.txt
βββ Petition5.txt
βββ Petition6.txt
βββ Petition7.txt
βββ Petition8.txt
βββ Petition9.txt
βββ result_table.csv
0 directories, 11 files
The results of the model are stored in the Wandb for better visualization and tracking of the model. Due to its better monitoring and tracking features, we have used wandb to store the results of the model.
Feel free to π§ me for any doubts/query (Mail to Me π)
- To make the system more robust by adding more historical cases.
- To use the Machine Learning based encoding techniques to encode the case description.
- To make the system more efficient by adding more models.
- To make the system more user friendly by adding a GUI.
Apache-2.0 Β© Chaganti Reddy