schema_analysis

Universal-ish Schema Analysis

Ever wished you could figure out what was in that json file? Or maybe it was xml... Ehr, yaml? It was definitely toml.

Alas, many great tools will only work with one of those formats, and the internet is not so nice a place as to finally understand that no, xml is not an acceptable document format.

Enter this neat little tool, a single interface to any self-describing format supported by our gymnast friend, serde.

Features

Works with any self-describing format with a Serde implementation.
Suitable for large files.
Keeps track of some useful info for each type.
Keeps track of null/normal/missing/duplicate values separately.
Integrates with Schemars and json_typegen to produce types and json schema if needed.
There's a demo website here.

Usage

let data: &[u8] = b"true";

// Just pick your format, and deserialize InferredSchema as if it were a normal type.
let inferred: InferredSchema = serde_json::from_slice(data)?;
// let inferred: InferredSchema = serde_yaml::from_slice(data)?;
// let inferred: InferredSchema = serde_cbor::from_slice(data)?;
// let inferred: InferredSchema = toml::from_slice(data)?;
// let inferred: InferredSchema = rawbson::de::from_bytes(data)?;
// let inferred: InferredSchema = quick_xml::de::from_reader(data)?;

// InferredSchema is a wrapper around Schema
let schema: Schema = inferred.schema;
let expected: Schema = Schema::Boolean(Default::default());
assert!(schema.structural_eq(&expected));

// The wrapper is there so we can both do the magic above, and also store the data for later
let serialized_schema: String = serde_json::to_string_pretty(&schema)?;

That's it.

Check Schema to see what info you get, and targets to see the available integrations (which include code and json schema generation).

Advanced Usage

I know, I know, the internet is evil and has decided to plague you with not one, but thousands, maybe even millions, of files.

Unfortunately Serde relies on type information to work, so ~~there is nothing we can do about it~~ we can bring out the big guns: DeserializeSeed. It's everything you love about Serde, but with runtime state.

let a_lot_of_json_files: &[&str] = &[ "1", "2", "1000" ];
let mut iter = a_lot_of_json_files.iter();

if let Some(file) = iter.next() {
    // We use the first file to generate a new schema to work with.
    let mut inferred: InferredSchema = serde_json::from_str(file)?;

    // Then we iterate over the rest to expand the schema.
    for file in iter {
        let mut json_deserializer = serde_json::Deserializer::from_str(file);
        // DeserializeSeed is implemented on &mut InferredSchema
        // So here it borrows the data mutably and runs it against the deserializer.
        let () = inferred.deserialize(&mut json_deserializer)?;
    }

    // The result in this case would be a simple integer schema
    // that 'has met' the numbers 1, 2, and 100.
    let mut context: NumberContext<i128> = Default::default();
    context.aggregate(&1);
    context.aggregate(&2);
    context.aggregate(&1000);

    assert_eq!(inferred.schema, Schema::Integer(context));
}

Furthermore, if you need to generate separate schemas (for example to run the analysis on multiple threads) you can use the Coalesce trait to merge them after-the-fact.

I really wish I could convert that Schema in something, you know, actually useful.

You are in luck! You can check out here the integrations with json_typegen and Schemars to convert the analysis into useful files like Rust types and json schemas. You can also find a demo website here.

How does this work?

For the short story long go here, the juicy bit is that Serde is kind enough to let the format tell us what it is working with, we take it from there and construct a nice schema from that info.

Performance

These are not proper benchmarks, but should give a vague idea of the performance on a i7-7700HQ laptop (2017) laptop with the raw data already loaded into memory.

Size	wasm (MB/s)	native (MB/s)	Format	File #
~180MB	~20s (9)	~5s (36)	json	1
~650MB	~150s (4.3)	~50s (13)	json	1
~1.7GB	~470s (3.6)	~145s (11.7)	json	1
~2.1GB	^a	~182s (11.5)	json	1
~13.3GB^b		~810s (16.4)	xml	~200k

^a This one seems to go over some kind of browser limit when fetching the data in the Web Worker, I believe I would have to split large files to handle it.

^b ~2.7GB compressed. This one seems like it would be a worst-case scenario because it includes decompression overhead and the files had a section that was formatted text which resulted in crazy schemas. (The json pretty printed schema was almost 0.5GB!)

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
schema_analysis		schema_analysis
website		website
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

schema_analysis

Universal-ish Schema Analysis

Features

Usage

Advanced Usage

I really wish I could convert that Schema in something, you know, actually useful.

How does this work?

Performance

About

Licenses found

Contributors 2

Languages

License

Licenses found

QuartzLibrary/schema_analysis

Folders and files

Latest commit

History

Repository files navigation

schema_analysis

Universal-ish Schema Analysis

Features

Usage

Advanced Usage

I really wish I could convert that Schema in something, you know, actually useful.

How does this work?

Performance

About

Resources

License

Licenses found

Stars

Watchers

Forks

Contributors 2

Languages