hermes

A Haskell interface over the simdjson C library for decoding JSON documents. Hermes, messenger of the gods, was the maternal great-grandfather of Jason, son of Aeson.

Overview

This library exposes functions that can be used to write decoders for JSON documents using the simdjson On Demand API. From the simdjson On Demand design documentation:

Good applications for the On Demand API might be:

You are working from pre-existing large JSON files that have been vetted. You expect them to be well formed according to a known JSON dialect and to have a consistent layout. For example, you might be doing biomedical research or machine learning on top of static data dumps in JSON.

Both the generation and the consumption of JSON data is within your system. Your team controls both the software that produces the JSON and the software the parses it, your team knows and control the hardware. Thus you can fully test your system.

You are working with stable JSON APIs which have a consistent layout and JSON dialect.

With this in mind, Data.Hermes parsers can decode Haskell types faster than traditional Data.Aeson.FromJSON instances, especially in cases where you only need to decode a subset of the document. This is because Data.Aeson.FromJSON converts the entire document into a Data.Aeson.Value, which means memory usage increases linearly with the input size. The simdjson::ondemand API does not have this constraint because it iterates over the JSON string in memory without constructing an intermediate tree. This means decoders are truly lazy and you only pay for what you use.

Hermes requires the entire document in memory. For an incremental JSON parser that supports streaming, see json-stream.

Usage

This library does not offer a Haskell API over the entire simdjson On Demand API. It currently binds only to what is needed for defining and running a Decoder. You can see the tests and benchmarks for example usage. simdjson::ondemand exceptions will be caught and re-thrown with enough information to troubleshoot. In the worst case you may run into a segmentation fault that is not caught, which you are encouraged to report as a bug.

Decoders

import qualified Data.ByteString as BS
import qualified Data.Hermes as H

personDecoder :: H.Decoder Person
personDecoder = H.object $
  Person
    <$> H.atKey "_id" H.text
    <*> H.atKey "index" H.int
    <*> H.atKey "guid" H.text
    <*> H.atKey "isActive" H.bool
    <*> H.atKey "balance" H.text
    <*> H.atKey "picture" (H.nullable H.text)
    <*> H.atKey "latitude" H.scientific

-- Decode a strict ByteString.
decodePersons :: BS.ByteString -> Either H.HermesException [Person]
decodePersons = H.decodeEither $ H.list personDecoder

Aeson Integration

While it is not recommended to use hermes if you need the full DOM, we still provide a performant interface to decode aeson Values. See an example of this in the hermes-aeson subpackage. You could use hermes to selectively decode aeson Values on demand, for example:

> decodeEither (atPointer "/statuses/99/user/screen_name" hValueToAeson) twitter
Right (String "2no38mae")

Exceptions

When decoding fails for a known reason, you will get a Left HermesException indicating if the error came from simdjson or from an internal hermes call.

> decodeEither (object . atKey "hello" $ list text) "{ \"hello\": [\"world\", false] }"
Left (SIMDException (DocumentError {path = "/hello/1", errorMsg = "Error while getting value of type text. INCORRECT_TYPE: The JSON element does not have the requested type."}))

Benchmarks

We benchmark the following operations using both hermes-json and aeson strict ByteString decoders:

Decode a small array of 3-element arrays of doubles
Full decoding of a large-ish (12 MB) JSON array of Person objects
Partial decoding of Twitter status objects to highlight the on-demand benefits
Decoding entire documents into Data.Aeson.Value

Please be aware that GHC does not report C-allocated memory. simdjson does actually allocate more memory than appears here, but we still strive to keep our Haskell memory footprint as small as possible.

Specs

GHC 9.8.2 w/ -O1
aeson-2.2 with text > 2.0
Apple M1 Pro

Name	Mean (ps)	2*Stdev (ps)	Allocated	Copied	Peak Memory
All.Hermes Arrays	1176636718	112205192	4021162	42064	94371840
All.Aeson Arrays	17213496875	475332150	71210605	1894420	94371840
All.Hermes Persons	44490500000	2894869504	128146854	23270453	134217728
All.Aeson Persons	129197600000	10025634382	338386171	119316159	227540992
All.Hermes Partial Twitter	280497070	13556162	276618	281	227540992
All.Aeson Partial Twitter	2791673437	155280944	11964607	187669	227540992
All.JsonStream Partial Twitter	2446430468	157827966	15090720	13248	227540992
All.Hermes Persons (Aeson Value)	118637050000	8397106074	276664184	105937104	227540992
All.Aeson Persons (Aeson Value)	104844225000	3647809494	269685425	96421514	250609664
All.Hermes Twitter (Aeson Value)	2933007812	223414634	11107284	189841	250609664
All.Aeson Twitter (Aeson Value)	2863515625	212391518	11652717	187635	250609664

Performance Tips

Decode to Text instead of String wherever possible!
Decode to Int or Double instead of Scientific if you can.
Decode your object fields in order. If encoding with aeson, you can leverage toEncoding to enforce ordering.

If you need to decode in tight loops or long-running processes (like a server), consider using the withHermesEnv/mkHermesEnv and parseByteString functions instead of decodeEither. This ensures the simdjson instances are not re-created on each decode. See the simdjson performance docs for more info. Please ensure that you use one HermesEnv per thread, as simdjson is single-threaded by default.

Limitations

Because the On Demand API in simdjson uses a forward-only iterator (except for object fields), it is possible to introduce unsafe iteration. Hermes tries to prevent this as much as possible with the type system.

The On Demand API does not validate the entire document upon creating the iterator (besides UTF-8 validation and basic well-formed checks). It is possible to parse an invalid JSON document but not realize it until later. If you need the entire document to be validated up front then a DOM parser is a better fit for you.

The On Demand approach is less safe than DOM: we only validate the components of the JSON document that are used and it is possible to begin ingesting an invalid document only to find out later that the document is invalid. Are you fine ingesting a large JSON document that starts with well formed JSON but ends with invalid JSON content?

Other limitations inherited from simdjson:

Cannot decode scalar documents, e.g. a single string, number, boolean, or null as a JSON document.
4GB is the maximum document size that simdjson supports.

Portability

Per the simdjson documentation:

A recent compiler (LLVM clang6 or better, GNU GCC 7.4 or better, Xcode 11 or better) on a 64-bit (PPC, ARM or x64 Intel/AMD) POSIX systems such as macOS, freeBSD or Linux. We require that the compiler supports the C 11 standard or better.

However, this library relies on std::string_view without a shim, so C 17 or later is required.

The native_comp cabal flag enables passing -march=native to the C compiler.

Passing -march=native to the compiler may make On Demand faster by allowing it to use optimizations specific to your machine. You cannot do this, however, if you are compiling code that might be run on less advanced machines. That is, be mindful that when compiling with the -march=native flag, the resulting binary will run on the current system but may not run on other systems (e.g., on an old processor).

If you are compiling on an ARM or POWER system, you do not need to be concerned with CPU selection during compilation. The -march=native flag is useful for best performance on x64 (e.g., Intel) systems but it is generally unsupported on some platforms such as ARM (aarch64) or POWER.

Name		Name	Last commit message	Last commit date
Latest commit History 160 Commits
.github/workflows		.github/workflows
cbits		cbits
hermes-aeson		hermes-aeson
hermes-bench		hermes-bench
simdjson @ e341c8b		simdjson @ e341c8b
src/Data		src/Data
tests		tests
.gitignore		.gitignore
.gitmodules		.gitmodules
.hlint.yaml		.hlint.yaml
.stylish-haskell.yaml		.stylish-haskell.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
cabal.project		cabal.project
hermes-json.cabal		hermes-json.cabal
wings.svg		wings.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hermes

Overview

Usage

Decoders

Aeson Integration

Exceptions

Benchmarks

Specs

Performance Tips

Limitations

Portability

About

Releases 9

Contributors 4

Languages

License

velveteer/hermes

Folders and files

Latest commit

History

Repository files navigation

hermes

Overview

Usage

Decoders

Aeson Integration

Exceptions

Benchmarks

Specs

Performance Tips

Limitations

Portability

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 9

Contributors 4

Languages