Skip to content
/ swan Public
forked from thatguystone/swan

An implementation of the Goose HTML Content / Article Extractor algorithm in golang

License

Notifications You must be signed in to change notification settings

Freespoke/swan

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Swan Build Status GoDoc

swan

An implementation of the Goose HTML Content / Article Extractor algorithm in golang.

Swan allows you to extract cleaned up text and HTML content from any webpage by removing all the extra junk that so many pages have these days.

Check out the go documentation page for full usage and examples.


Features

  • Main content extraction from almost any source
  • Extract HTML content with images
  • Get article metadata, publish dates, and a lot more
  • Recognize different content types and apply special extractions (currently only recognizes comic sites and normal sites)

Planned

  • Inline videos into HTML content when found in an article
  • Recognize news sources and extract corresponding video / audio content
  • Recognize and extract more types of content
  • An interesting idea: buriy/python-readability#57 (comment)

About

An implementation of the Goose HTML Content / Article Extractor algorithm in golang

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • HTML 95.4%
  • Go 4.6%