Web Crawl is a web application that allows you to scrape web pages and use Retrieval-Augmented Generation (RAG) to answer questions based on the scraped content. The application is built with Streamlit and utilizes OpenAI's language models for text generation.
- Scrape web pages in parallel
- Store scraped content in a knowledge base
- Perform similarity search on the stored content
- Use OpenAI's language models to answer questions based on the scraped content
- Configuration options for chunk size, overlap, and crawling depth
- View and manage scraped JSON data and FAISS vector stores
-
Clone the repository:
git clone https://github.com/mhadeli/web-crawler.git cd deep-crawl
-
Create a virtual environment and install dependencies:
python -m venv env source env/bin/activate # On Windows, use `env\Scripts\activate` pip install -r requirements.txt
-
Set up the configuration by editing the
settings.json
file or using the settings page in the Streamlit app. -
Run the Streamlit app:
streamlit run chat.py
-
Enter your OpenAI API key in the sidebar.
-
Enter the URLs to scrape in the sidebar and click "Scrape and Add to Knowledge Base".
-
Ask questions based on the scraped content using the chat interface.
The configuration options can be set in settings.json
or through the Streamlit settings page:
model
: The OpenAI model to use (e.g., "gpt-3.5-turbo", "gpt-4o")top_k
: The number of similar documents to retrievechunk_size
: The size of text chunks for processingchunk_overlap
: The overlap between text chunksmin_content_length
: The minimum length of HTML content to considermax_depth
: The maximum crawling depth
chat.py
: Main Streamlit app scriptcrawler.py
: Script for scraping web pagessettings.py
: Script for configuring the settingsknowledge_base.py
: Script for managing the knowledge basesettings.json
: JSON file for storing configuration settings
This project is licensed under the MIT License.
Contributions are welcome! Please open an issue or submit a pull request.