Skip to content

Commit

Permalink
Merge pull request #240 from rinpatch/feature/fast_html
Browse files Browse the repository at this point in the history
Add fast_html parser and test all parsers on CI
  • Loading branch information
philss authored Dec 31, 2019
2 parents bfad9ca + 17eab75 commit f3cf95b
Show file tree
Hide file tree
Showing 6 changed files with 137 additions and 16 deletions.
13 changes: 10 additions & 3 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,32 +5,39 @@ on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
env:
PARSER: ${{ matrix.parser }}

container:
image: elixir:${{ matrix.elixir }}-slim

name: Elixir ${{ matrix.elixir }}
name: Elixir ${{ matrix.elixir }} with ${{ matrix.parser }}

strategy:
fail-fast: false
matrix:
elixir: [1.9, 1.8, 1.7, 1.6, 1.5]
parser: [fast_html, html5ever, mochiweb]

steps:
- uses: actions/checkout@v1.0.0

- name: Install dependencies
run: |-
apt-get update
if [ "$PARSER" = "fast_html" ]; then apt-get -y install build-essential; fi
if [ "$PARSER" = "html5ever" ]; then apt-get -y install cargo; fi
mix local.rebar --force
mix local.hex --force
mix deps.get
- name: Check format
if: matrix.elixir >= 1.6
if: matrix.elixir >= 1.8
run: mix format --check-formatted

- name: Run tests
run: |-
mix test
mix test.$PARSER
- name: Run inch.report
if: matrix.elixir >= 1.7
Expand Down
70 changes: 58 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,21 +86,39 @@ If you get this [kind of error](https://github.com/philss/floki/issues/35),
you need to install the `erlang-dev` and `erlang-parsetools` packages in order get the `leex` module.
The packages names may be different depending on your OS.

### Optional - Using html5ever as the HTML parser
### Alternative HTML parsers

You can configure Floki to use [html5ever](https://github.com/servo/html5ever) as your HTML parser.
This is recommended if you need [better performance](https://gist.github.com/philss/70b4b0294f29501c3c7e0f60338cc8bd)
and a more accurate parser. However `html5ever` is being under active development and **may be unstable**.
By default Floki uses a patched version of `mochiweb_html` for parsing fragments
due to it's ease of installation (it's written in Erlang and has no outside dependencies).

Since it's written in Rust, we need to install Rust and compile the project.
Luckily we have the [html5ever Elixir NIF](https://github.com/hansihe/html5ever_elixir) that makes the integration very easy.
However one might want to use an alternative parser due to the following
concerns:

You still need to install Rust in your system. To do that, please
[follow the instruction](https://www.rust-lang.org/en-US/install.html) presented in the official page.
- Performance - It can be [up to 20 times slower than the alternatives](https://hexdocs.pm/fast_html/readme.html#benchmarks) on big HTML
documents.
- Correctness - in some cases `mochiweb_html` will produce different results
from what is specified in [HTML5 specification](https://html.spec.whatwg.org/)](https://html.spec.whatwg.org/).
For example, a correct parser would parse `<title> <b> bold </b> text </title>`
as `{"title", [], [" <b> bold </b> text "]}` since content inside `<title>` is
to be [treated as plaintext](https://html.spec.whatwg.org/#the-title-element).
Albeit `mochiweb_html` would parse it as `{"title", [], [{"b", [], [" bold "]}, " text "]}`.

Floki supports the following alternative parsers:

#### Installing html5ever
- `fast_html` - A wrapper for lexborisov's [myhtml](https://github.com/lexborisov/myhtml/). A pure C HTML parser.
- `html5ever` - A wrapper for [html5ever](https://github.com/servo/html5ever) written in Rust, developed as a part of the Servo project.

After setup Rust, you need to add `html5ever` NIF to your dependency list:
`fast_html` is generally faster, according to the
[benchmarks](https://hexdocs.pm/fast_html/readme.html#benchmarks) conducted by
it's developers. Though `html5ever` does have an advantage on really small
(~4kb) fragments due to it being implemented as a NIF.

#### Using `html5ever` as the HTML parser

Rust needs to be installed on the system in order to compile html5ever. To do that, please
[follow the instruction](https://www.rust-lang.org/en-US/install.html) presented in the official page.

After Rust is set up, you need to add `html5ever` NIF to your dependency list:

```elixir
defp deps do
Expand All @@ -121,10 +139,38 @@ Then you need to configure your app to use `html5ever`:
config :floki, :html_parser, Floki.HTMLParser.Html5ever
```

After that you are able to use `html5ever` as your HTML parser with Floki.

For more info, check the article [Rustler - Safe Erlang and Elixir NIFs in Rust](http://hansihe.com/2017/02/05/rustler-safe-erlang-elixir-nifs-in-rust.html).

#### Using `fast_html` as the HTML parser

A C compiler and GNU\Make needs to be installed on the system in order to
compile myhtml. It's likely that your machine has them already.

Note that you also need to have `epmd` started/available to start due to `fast_html` relying on a
C-Node worker, usually it will be started automatically, but some distributions
(i.e Gentoo Linux) enforce only being able to start it as a service.

First, add `fast_html` to your dependencies:

```elixir
defp deps do
[
{:floki, "~> 0.23.0"},
{:fast_html, "~> 1.0"}
]
end
```

Run `mix deps.get` and compiles the project with `mix compile` to make sure it works.

Then you need to configure your app to use `fast_html`:

```elixir
# in config/config.exs

config :floki, :html_parser, Floki.HTMLParser.FastHTML
```

## More about Floki API

To parse a HTML document, try:
Expand Down
16 changes: 16 additions & 0 deletions lib/floki/html_parser/fast_html.ex
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
defmodule Floki.HTMLParser.FastHtml do
@moduledoc false

def parse(html) do
case Code.ensure_loaded(:fast_html) do
{:module, module} ->
case module.decode(html) do
{:ok, result} -> result
{:error, _message} = error -> error
end

{:error, _reason} ->
raise "Expected module :fast_html to be available."
end
end
end
45 changes: 44 additions & 1 deletion mix.exs
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ defmodule Floki.Mixfile do
elixir: "~> 1.5",
package: package(),
deps: deps(),
aliases: aliases(),
source_url: "https://github.com/philss/floki",
docs: [extras: ["README.md"], main: "Floki", assets: "assets"]
]
Expand All @@ -23,13 +24,55 @@ defmodule Floki.Mixfile do
end

defp deps do
# Needed to avoid installing unnecessary deps on the CI
parsers =
case System.get_env("PARSER") do
nil -> [:fast_html, :html5ever]
parser -> [String.to_atom(parser)]
end
|> Enum.map(fn name -> {name, ">= 0.0.0", optional: true, only: [:dev, :test]} end)

[
{:html_entities, "~> 0.5.0"},
{:earmark, "~> 1.2", only: :dev},
{:ex_doc, "~> 0.18", only: :dev},
{:credo, ">= 0.0.0", only: [:dev, :test]},
{:inch_ex, ">= 0.0.0", only: :docs}
]
] ++ parsers
end

defp aliases do
# Hardcoded because we can't load the floki application and get the module list at this point.
parsers = [Floki.HTMLParser.Mochiweb, Floki.HTMLParser.FastHtml, Floki.HTMLParser.Html5ever]

{aliases, cli_names} =
Enum.map_reduce(parsers, [], fn parser, acc ->
cli_name =
parser
|> Module.split()
|> List.last()
|> Macro.underscore()

{{:"test.#{cli_name}", &test_with_parser(parser, &1)}, [cli_name | acc]}
end)

aliases
|> Keyword.put(:test, &test_with_parser(cli_names, &1))
end

defp test_with_parser(parser_cli_names, args) when is_list(parser_cli_names) do
Enum.each(parser_cli_names, fn cli_name ->
Mix.shell().cmd("mix test.#{cli_name} --color #{Enum.join(args, " ")}",
env: [{"MIX_ENV", "test"}]
)
end)
end

defp test_with_parser(parser, args) do
Mix.shell().info("Running tests with #{parser}")
Application.put_env(:floki, :html_parser, parser, persistent: true)
Mix.env(:test)
Mix.Tasks.Test.run(args)
end

defp package do
Expand Down
3 changes: 3 additions & 0 deletions mix.lock
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
"credo": {:hex, :credo, "1.1.5", "caec7a3cadd2e58609d7ee25b3931b129e739e070539ad1a0cd7efeeb47014f4", [:mix], [{:bunt, "~> 0.2.0", [hex: :bunt, repo: "hexpm", optional: false]}, {:jason, "~> 1.0", [hex: :jason, repo: "hexpm", optional: false]}], "hexpm"},
"earmark": {:hex, :earmark, "1.4.3", "364ca2e9710f6bff494117dbbd53880d84bebb692dafc3a78eb50aa3183f2bfd", [:mix], [], "hexpm"},
"ex_doc": {:hex, :ex_doc, "0.21.2", "caca5bc28ed7b3bdc0b662f8afe2bee1eedb5c3cf7b322feeeb7c6ebbde089d6", [:mix], [{:earmark, "~> 1.3.3 or ~> 1.4", [hex: :earmark, repo: "hexpm", optional: false]}, {:makeup_elixir, "~> 0.14", [hex: :makeup_elixir, repo: "hexpm", optional: false]}], "hexpm"},
"fast_html": {:hex, :fast_html, "1.0.1", "5bc7df4dc4607ec2c314c16414e4111d79a209956c4f5df96602d194c61197f9", [:make, :mix], [], "hexpm"},
"html5ever": {:hex, :html5ever, "0.7.0", "9f63ec1c783b2dc9f326840fcc993c01e926dbdef4e51ba1bbe5355993c258b4", [:mix], [{:rustler, "~> 0.18.0", [hex: :rustler, repo: "hexpm", optional: false]}], "hexpm"},
"html_entities": {:hex, :html_entities, "0.5.0", "40f5c5b9cbe23073b48a4e69c67b6c11974f623a76165e2b92d098c0e88ccb1d", [:mix], [], "hexpm"},
"inch_ex": {:hex, :inch_ex, "2.0.0", "24268a9284a1751f2ceda569cd978e1fa394c977c45c331bb52a405de544f4de", [:mix], [{:bunt, "~> 0.2", [hex: :bunt, repo: "hexpm", optional: false]}, {:jason, "~> 1.0", [hex: :jason, repo: "hexpm", optional: false]}], "hexpm"},
"jason": {:hex, :jason, "1.1.2", "b03dedea67a99223a2eaf9f1264ce37154564de899fd3d8b9a21b1a6fd64afe7", [:mix], [{:decimal, "~> 1.0", [hex: :decimal, repo: "hexpm", optional: true]}], "hexpm"},
Expand All @@ -11,4 +13,5 @@
"mochiweb": {:hex, :mochiweb, "2.18.0", "eb55f1db3e6e960fac4e6db4e2db9ec3602cc9f30b86cd1481d56545c3145d2e", [:rebar3], [], "hexpm"},
"nimble_parsec": {:hex, :nimble_parsec, "0.5.1", "c90796ecee0289dbb5ad16d3ad06f957b0cd1199769641c961cfe0b97db190e0", [:mix], [], "hexpm"},
"poison": {:hex, :poison, "3.1.0", "d9eb636610e096f86f25d9a46f35a9facac35609a7591b3be3326e99a0484665", [:mix], [], "hexpm"},
"rustler": {:hex, :rustler, "0.18.0", "db4bd0c613d83a1badc31be90ddada6f9821de29e4afd15c53a5da61882e4f2d", [:mix], [], "hexpm"},
}
6 changes: 6 additions & 0 deletions test/test_helper.exs
Original file line number Diff line number Diff line change
@@ -1 +1,7 @@
# fast_html uses a C-Node worker for parsing, so starting the application
# is necessary for it to work
if Application.get_env(:floki, :html_parser) == Floki.HTMLParser.FastHtml do
Application.ensure_all_started(:fast_html)
end

ExUnit.start()

0 comments on commit f3cf95b

Please # to comment.