Web Crawling

Overview

Web Crawl is a web application that allows you to scrape web pages and use Retrieval-Augmented Generation (RAG) to answer questions based on the scraped content. The application is built with Streamlit and utilizes OpenAI's language models for text generation.

Features

Scrape web pages in parallel
Store scraped content in a knowledge base
Perform similarity search on the stored content
Use OpenAI's language models to answer questions based on the scraped content
Configuration options for chunk size, overlap, and crawling depth
View and manage scraped JSON data and FAISS vector stores

Installation

Clone the repository:

git clone https://github.com/mhadeli/web-crawler.git
cd deep-crawl

Create a virtual environment and install dependencies:

python -m venv env
source env/bin/activate   # On Windows, use `env\Scripts\activate`
pip install -r requirements.txt

Usage

Set up the configuration by editing the settings.json file or using the settings page in the Streamlit app.
Run the Streamlit app:
```
streamlit run chat.py
```
Enter your OpenAI API key in the sidebar.
Enter the URLs to scrape in the sidebar and click "Scrape and Add to Knowledge Base".
Ask questions based on the scraped content using the chat interface.

Configuration

The configuration options can be set in settings.json or through the Streamlit settings page:

model: The OpenAI model to use (e.g., "gpt-3.5-turbo", "gpt-4o")
top_k: The number of similar documents to retrieve
chunk_size: The size of text chunks for processing
chunk_overlap: The overlap between text chunks
min_content_length: The minimum length of HTML content to consider
max_depth: The maximum crawling depth

Project Structure

chat.py: Main Streamlit app script
crawler.py: Script for scraping web pages
settings.py: Script for configuring the settings
knowledge_base.py: Script for managing the knowledge base
settings.json: JSON file for storing configuration settings

License

This project is licensed under the MIT License.

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
pages		pages
Chat.py		Chat.py
README.md		README.md
crawler.py		crawler.py
requirements.txt		requirements.txt
settings.json		settings.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Crawling

Overview

Features

Installation

Usage

Configuration

Project Structure

License

Contributing

About

Releases

Packages

Languages

mhadeli/web-crawler

Folders and files

Latest commit

History

Repository files navigation

Web Crawling

Overview

Features

Installation

Usage

Configuration

Project Structure

License

Contributing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages