This file contains library functions and commands useful for retrieving web page content and processing it into Org-mode content.
For example, you can copy a URL to the clipboard or kill-ring, then run a command that downloads the page, isolates the “readable” content with eww-readable
, converts it to Org-mode content with Pandoc, and displays it in an Org-mode buffer. Another command does all of that but inserts it as an Org entry instead of displaying it in a new buffer.
- Emacs 25.1 or later.
- Commands that process HTML into Org require Pandoc. Note: The output of current Pandoc versions differs substantially from versions that may still be present in stable Linux distros. If you encounter any issues, please install a more recent version of Pandoc.
If you installed from MELPA, just run one of the commands below. If you want to use any of the functions in your own code, you should (require 'org-web-tools)
.
Install dash.el, esxml, request, and s.el. Then require this package in your init file:
(require 'org-web-tools)
org-web-tools-insert-link-for-url
: Insert an Org-mode link to the URL in the clipboard or kill-ring. Downloads the page to get the HTML title.org-web-tools-insert-web-page-as-entry
: Insert the web page for the URL in the clipboard or kill-ring as an Org-mode entry, as a sibling heading of the current entry.org-web-tools-read-url-as-org
: Display the web page for the URL in the clipboard or kill-ring as Org-mode text in a new buffer, processed witheww-readable
.org-web-tools-convert-links-to-page-entries
: Convert all URLs and Org links in current Org entry to Org headings, each containing the web page content of that URL, converted to Org-mode text and processed witheww-readable
. This should be called on an entry that solely contains a list of URLs or links.org-web-tools-archive-attach
: Download archive of page at URL and attach withorg-attach
. IfCHOOSE-FN
is non-nil (interactively, with universal prefix), prompt for the archive function to use. IfVIEW
is non-nil (interactively, with two universal prefixes), view the archive immediately after attaching. (See also org-board).org-web-tools-archive-view
: Open Zip file archive of web page. Extracts to a temp directory and opens withbrowse-url-default-browser
. Note: the extracted files are left on-disk in the temp directory.
These are used in the commands above and may be useful in building your own commands.
org-web-tools--dom-to-html
: Return parsed HTML DOM as an HTML string. Note: This is an approximation and is not necessarily correct HTML (e.g. IMG tags may be rendered with a closing “</img>” tag).org-web-tools--eww-readable
: Return “readable” part of HTML with title.org-web-tools--get-url
: Return content for URL as string.org-web-tools--html-title
: Return title of HTML page.org-web-tools--html-to-org-with-pandoc
: Return string of HTML converted to Org with Pandoc. When SELECTOR is non-nil, the HTML is filtered usingesxml-query
SELECTOR and re-rendered to HTML withorg-web-tools--dom-to-html
, which see.org-web-tools--url-as-readable-org
: Return string containing Org entry of URL’s web page content. Content is processed witheww-readable
and Pandoc. Entry will be a top-level heading, with article contents below a second-level “Article” heading, and a timestamp in the first-level entry for writing comments.org-web-tools--demote-headings-below
: Demote all headings in buffer so the highest level is below LEVEL.org-web-tools--get-first-url
: Return URL in clipboard, or first URL in the kill-ring, or nil if none.org-web-tools--read-url
: Return a URL by searching at point, then in clipboard, then in kill-ring, and finally prompting the user.org-web-tools--read-org-bracket-link
: Return (TARGET . DESCRIPTION) for Org bracket LINK or next link on current line.org-web-tools--remove-dos-crlf
: Remove all DOS CRLF (^M) in buffer.
Nothing new yet.
Improvements
- Archiving tools:
- Can use multiple functions to attempt archiving.
- Associated options control retry attempts, delays, and fallbacks to other functions.
- Functions to archive Web pages with
wget
andtar
:- Function
org-web-tools-archive--wget-tar
archives a URL’s Web page, including page resources. - Function
org-web-tools-archive--wget-tar-html-only
archives a URL’s HTML only.
- Function
- Command
org-web-tools-archive-view
handles bothzip
andtar
archives. - The default settings use
wget
andtar
to archive pages (because thearchive.today
service has not worked reliably with external tools for a long time).
Changes
- Option
org-web-tools-archive-fn
defaults to usingwget
andtar
to archive pages to XZ archives with HTML and page resources. (Thearchive.is
service has not worked reliably with other tools for a long time.)
Fixes
org-web-tools--org-link-for-url
now returns the URL if the HTML page has no title tag. This avoids an error, e.g. when used in an Org capture template.
Compatibility
- Emacs 27.1 or later is now required.
- Updated for Org 9.3’s changes to
org-bracket-link-regexp
. (Thanks to Aaron Zeng and Akira Komamura.) - Activate
org-mode
in temporary buffer fororg-web-tools--html-to-org-with-pandoc
. (#56. Thanks to mooseyboots.) - Use
compat
library.
Fixed
- Only test non-nil items in
org-web-tools--get-first-url
. This makes it work properly in non-GUI Emacs sessions. (Thanks to Ben Sima for reporting.)
Fixed
- Require
org-attach
.
Additions
- Command
org-web-tools-attach-url-archive
. - Command
org-web-tools-view-archive
. - Function
org-web-tools--read-url
.
Changes
- Remove all property drawers that contain the
CUSTOM_ID
property from Pandoc output.
- First declared stable release.
Contributions and suggestions are welcome.
GPLv3