Skip to content
forked from philss/floki

Floki is a simple HTML parser that enables search for nodes using CSS selectors.

License

Notifications You must be signed in to change notification settings

aphillipo/floki

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Floki

Build status Floki version Hex.pm Deps Status Inline docs Ebert

Floki is a simple HTML parser that enables search for nodes using CSS selectors.

Check the documentation.

Usage

Take this HTML as an example:

<!doctype html>
<html>
<body>
  <section id="content">
    <p class="headline">Floki</p>
    <span class="headline">Enables search using CSS selectors</span>
    <a href="https://github.com/philss/floki">Github page</a>
    <span data-model="user">philss</span>
  </section>
  <a href="https://hex.pm/packages/floki">Hex package</a>
</body>
</html>

Here are some queries that you can perform (with return examples):

Floki.find(html, "p.headline")
# => [{"p", [{"class", "headline"}], ["Floki"]}]


Floki.find(html, "p.headline")
|> Floki.raw_html
# => <p class="headline">Floki</p>


Floki.find(html, "a[href^=https]")
# => [{"a", [{"href", "https://hex.pm/packages/floki"}], ["Hex package"]}]


Floki.find(html, "#content a")
# => [{"a", [{"href", "https://github.com/philss/floki"}], ["Github page"]}]


Floki.find(html, "[data-model=user]")
# => [{"span", [{"data-model", "user"}], ["philss"]}]


Floki.find(html, ".headline:nth-child(1), a")
# => [{"p", [{"class", "headline"}], ["Floki"]},
# =>  {"a", [{"href", "https://github.com/philss/floki"}], ["Github page"]},
# =>  {"a", [{"href", "https://hex.pm/packages/floki"}], ["Hex package"]}]

Each HTML node is represented by a tuple like:

{tag_name, attributes, children_nodes}

Example of node:

{"p", [{"class", "headline"}], ["Floki"]}

So even if the only child node is the element text, it is represented inside a list.

You can write a simple HTML crawler with Floki and HTTPoison:

html
|> Floki.find(".pages a")
|> Floki.attribute("href")
|> Enum.map(fn(url) -> HTTPoison.get!(url) end)

It is simple as that!

Installation

Add Floki to your mix.exs:

defp deps do
  [
    {:floki, "~> 0.17.0"}
  ]
end

After that, run mix deps.get.

Dependencies

Floki needs the leex module in order to compile. Normally this module is installed with Erlang in a complete installation.

If you get this kind of error, you need to install the erlang-dev and erlang-parsetools packages in order get the leex module. The packages names may be different depending on your OS.

Optional - Using http5ever as the HTML parser

You can configure Floki to use html5ever as your HTML parser. This is recommended if you need better performance and a more accurate parser. However html5ever is being under active development and may be unstable.

Since it's written in Rust, we need to install Rust and compile the project. Luckily we have have the html5ever Elixir NIF that makes the integration very easy.

You still need to install Rust in your system. To do that, please follow the instruction presented in the official page.

Installing html5ever

After setup Rust, you need to add html5ever NIF to your dependency list:

defp deps do
  [
    {:floki, "~> 0.17.0"},
    {:html5ever, "~> 0.3.0"}
  ]
end

Run mix deps.get and compiles the project with mix compile to make sure it works.

Then you need to configure your app to use html5ever:

# in config/config.exs

config :floki, :html_parser, Floki.HTMLParser.Html5ever

After that you are able to use html5ever as your HTML parser with Floki.

For more info, check the article Rustler - Safe Erlang and Elixir NIFs in Rust.

More about Floki API

To parse a HTML document, try:

html = """
  <html>
  <body>
    <div class="example"></div>
  </body>
  </html>
"""

Floki.parse(html)
# => {"html", [], [{"body", [], [{"div", [{"class", "example"}], []}]}]}

To find elements with the class example, try:

Floki.find(html, ".example")
# => [{"div", [{"class", "example"}], []}]

To convert your node tree back to raw HTML (spaces are ignored):

Floki.find(html, ".example")
|> Floki.raw_html
# =>  <div class="example"></div>

To fetch some attribute from elements, try:

Floki.attribute(html, ".example", "class")
# => ["example"]

You can get attributes from elements that you already have:

Floki.find(html, ".example")
|> Floki.attribute("class")
# => ["example"]

If you want to get the text from an element, try:

Floki.find(html, ".headline")
|> Floki.text

# => "Floki"

License

Floki is under MIT license. Check the LICENSE file for more details.

About

Floki is a simple HTML parser that enables search for nodes using CSS selectors.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Elixir 97.4%
  • Erlang 2.6%