NetExtract

NetExtract is crafted to extract core content from webpages and convert it into clean, LLM-friendly text. Leveraging the power of Express.js, TypeScript, and Puppeteer, it offers a streamlined API for efficient content extraction and transformation, making it an invaluable tool for enhancing LLM and RAG systems with up-to-date web information and API web scraping.

Features

Core Content Extraction: Seamlessly extracts essential content from any URL.
Markdown Conversion: Converts webpage content into clean, well-formatted Markdown.
Social Media Scraping: Efficiently scrapes and formats X (Twitter) posts.
Simple API Integration: Easily integrates with existing systems.
LLM-Powered Conversion: Utilizes open-source large language models to enhance the extraction and conversion process, ensuring high-quality output.

📖 Usage

To use NetExtract, prepend the API endpoint to your desired URL:

http://{your_address}/api?url={url}

🗂️ Getting started with Docker

git clone https://github.com/sabber-slt/NetExtract
cd NetExtract

Then run the application with Docker:

docker compose up -d

⚡️ Acknowledgments

Inspired by jina.ai
Built with Node.js, Express.js, TypeScript, and Puppeteer

🧩 Structure

.
├── cookie
│   └── twitter.json            # Twitter cookie for X (Twitter) post scraping
├── docs                        # Documentation files
├── search                      # Searxng engine
├── src                         # Source code
│   ├── interfaces              # TypeScript interfaces
│   ├── lib                     # Utility libraries
│   ├── routes                  # Express route handlers
│   ├── services                # Core service layer for business logic
│   ├── utils                   # Helper functions and utilities
│   └── app.ts                  # Main application entry point
├── .env                        # Environment variables
├── .gitignore                  # Git ignored files
├── .prettierignore             # Prettier ignored files
├── .prettierrc.js              # Prettier configuration
├── app.log                     # Log file
├── Dockerfile                  # Dockerfile
├── docker-compose.yaml         # Docker Compose configuration
├── package.json                # Node.js project metadata
├── README.md                   # Project README
├── tsconfig.json               # TypeScript configuration
└── yarn.lock                   # Yarn lockfile for dependency management

🤝 Contributing

I welcome and appreciate contributions! If you'd like to contribute, please feel free to submit issues, fork the repository, and send pull requests.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NetExtract

Features

📖 Usage

🗂️ Getting started with Docker

⚡️ Acknowledgments

🧩 Structure

🤝 Contributing

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
cookie		cookie
search		search
src		src
ui		ui
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.prettierignore		.prettierignore
.prettierrc.js		.prettierrc.js
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
package.json		package.json
tsconfig.json		tsconfig.json
yarn.lock		yarn.lock

License

sabber-slt/NetExtract

Folders and files

Latest commit

History

Repository files navigation

NetExtract

Features

📖 Usage

🗂️ Getting started with Docker

⚡️ Acknowledgments

🧩 Structure

🤝 Contributing

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages