Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streams of JSON documents Large files (>4GB) (#350) #364

Merged
merged 1 commit into from
Nov 8, 2019
Merged

Conversation

piotte13
Copy link
Collaborator

@piotte13 piotte13 commented Nov 7, 2019

Hello everyone,
We got a working prototype that could solve this issue and allow us to support large files (>4GB)!

Exemple usage:

#include "simdjson/parsedjson.h"
#include "simdjson/jsonstream.h"

simdjson::padded_string string;
simdjson::get_corpus(filename).swap(string);


//simdjson::JsonStream js{string};
//simdjson::JsonStream js{string, 1000000};
//simdjson::JsonStream js{string.data(), string.size()};
simdjson::JsonStream js{string.data(), string.size(), 1000000}; 
//Where 1000000 represents the batch_size in bytes.  
//We load data in batch to speed things up a bit. (default = 1MB)


int parse_res = simdjson::SUCCESS_AND_HAS_MORE;
simdjson::ParsedJson pj;

//We re-use the same ParsedJson as much as possible, to save allocation time.
//This loop will run as long as the file contains valid JSON documents.
while (parse_res == simdjson::SUCCESS_AND_HAS_MORE) {
    parse_res = js.json_parse(pj);
}

Implementation

The current implementation allows us to parse through large documents faster than the "simple" std::getline() alternative. We still have many ideas to make it even faster, but right now it feels like a good starting point.

What kind of file is supported?

We opted for a more general case where a file could contain multiple JSON documents no matter the format, as long as it is JSON valid. Therefore, we are not limited to ndjson and jsonline. We support any file that contains multiple valid JSON documents.

API

I would greatly appreciate some feedback. From my point of view, it seems simple and easy to use, but i'm not an end user!

  • Does it fill your needs?
  • Is it clear, simple and complete enough?
  • Suggestions?

Cheers

* rough prototype working.  Needs more test and fine tuning.

* prototype working on large files.

* prototype working on large files.

* Adding benchmarks

* jsonstream API adjustment

* type

* minor fixes and cleaning.

* minor fixes and cleaning.

* removing warnings

* removing some copies

* runtime dispatch error fix

* makefile linking src/jsonstream.cpp

* fixing arm stage 1 headers

* fixing stage 2 headers

* fixing stage 1 arm header

* making jsonstream portable

* cleaning imports

* including <algorithms> for windows compiler

* cleaning benchmark imports

* adding jsonstream to amalgamation

* merged main into branch

* bug fix where JsonStream would bug on rare cases.

* Addind a JsonStream Demo to Amalgamation

* Fix for #345

* Follow up test and fix for #345 (#347)

* Final (?) fix for #345

* Verbose basictest

* Being more forgiving of powers of ten.

* Let us zero the tail end.

* add basic fuzzers (#348)

* add basic fuzzing using libFuzzer

* let cmake respect cflags, otherwise the fuzzer flags go unnoticed

also, integrates badly with oss-fuzz

* add new fuzzer for minification, simplify the old one

* add fuzzer for the dump example

* clang format

* adding Paul Dreik

* rough prototype working.  Needs more test and fine tuning.

* prototype working on large files.

* prototype working on large files.

* Adding benchmarks

* jsonstream API adjustment

* type

* minor fixes and cleaning.

* Fixing issue 351 (#352)

* Fixing issues 351 and 353

* minor fixes and cleaning.

* removing warnings

* removing some copies

* Fix ARM compile errors on g   7.4 (#354)

* Fix ARM compilation errors

* Update singleheader

* runtime dispatch error fix

* makefile linking src/jsonstream.cpp

* fixing arm stage 1 headers

* fixing stage 2 headers

* fixing stage 1 arm header

* fix integer overflow in subnormal_power10 (#355)

detected by oss-fuzz

https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=18714

* Adding new test file, following #355

* making jsonstream portable

* cleaning imports

* including <algorithms> for windows compiler

* cleaning benchmark imports

* adding jsonstream to amalgamation

* merged main into branch

* bug fix where JsonStream would bug on rare cases.

* Addind a JsonStream Demo to Amalgamation

* merging main

* rough prototype working.  Needs more test and fine tuning.

* prototype working on large files.

* prototype working on large files.

* Adding benchmarks

* jsonstream API adjustment

* minor fixes and cleaning.

* minor fixes and cleaning.

* removing warnings

* removing some copies

* runtime dispatch error fix

* makefile linking src/jsonstream.cpp

* fixing arm stage 1 headers

* fixing stage 2 headers

* fixing stage 1 arm header

* making jsonstream portable

* cleaning imports

* including <algorithms> for windows compiler

* cleaning benchmark imports

* adding jsonstream to amalgamation

* bug fix where JsonStream would bug on rare cases.

* Addind a JsonStream Demo to Amalgamation

* rough prototype working.  Needs more test and fine tuning.

* minor fixes and cleaning.

* adding jsonstream to amalgamation

* merged main into branch

* Addind a JsonStream Demo to Amalgamation

* merging main

* merging main

* make file fix
@lgtm-com
Copy link

lgtm-com bot commented Nov 7, 2019

This pull request introduces 2 alerts when merging 02dc670 into c4f1baa - view on LGTM.com

new alerts:

  • 1 for Missing header guard
  • 1 for Use of goto

@lemire
Copy link
Member

lemire commented Nov 8, 2019

The failing test refers to a minor perf. regression under ARM which as far as I know is not relevant with respect to this PR.

I think we can merge this PR as soon as the author recommends it.

@piotte13
Copy link
Collaborator Author

piotte13 commented Nov 8, 2019

I will merge this for now and I will make another pull request concerning the documentation.

@piotte13 piotte13 merged commit bdc2b07 into master Nov 8, 2019
@rbnx
Copy link

rbnx commented Nov 10, 2019

@piotte13 I tested this new functionality after it got merged and it only seems to work with line-delimited JSON files, are there plans to add support for 4gb files that are not line-delimeted.

I'm on macOS Mojave
This is the code I ran:
https://github.com/lemire/simdjson/pull/364/files#diff-58125dd7898378f560eac16c69fa5198R152-R167

@lemire
Copy link
Member

lemire commented Nov 10, 2019

@rbnx It would be best to open a new issue so we can track the problem.

I think that the intention is definitively to provide broad support and not require line-delimitation.

cc @piotte13

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants