Skip to content

Csv schema is a library helping you to build Ecto.Schema-like modules having a csv file as source

License

Notifications You must be signed in to change notification settings

primait/csv_schema

Repository files navigation

Csv Schema

Build Status Module Version Hex Docs Total Download License Last Updated

Csv schema is a library helping you to build Ecto.Schema-like modules having a csv file as source.

The idea behind this library is give the possibility to create, at compile-time, a self-contained module exposing functions to retrieve data starting from a CSV.

Installation

The package can be installed by adding :csv_schema to your list of dependencies in mix.exs:

def deps do
  [
    {:csv_schema, "~> 0.2.8"}
  ]
end

Usage

Supposing you have a CSV file looking like this:

id first_name last_name email gender ip_address date_of_birth
1 Ivory Overstreet [email protected] Female 30.138.91.62 10/22/2018
2 Ulick Vasnev [email protected] Male 35.15.164.70 01/19/2018
3 Chloe Freemantle [email protected] Female 133.133.113.255 08/13/2018
... ... ... ... ... ... ...

It is possible to create an Ecto.Schema-like repository using Csv.Schema macro:

defmodule Person do
  use Csv.Schema
  alias Csv.Schema.Parser

  schema path: "path/to/person.csv" do
    field :id, "id"
    field :first_name, "first_name", filter_by: true
    field :last_name, "last_name", sort: :asc
    field :identifier, ["first_name", "last_name"], key: true, join: " "
    field :email, "email", unique: true
    field :gender, "gender", filter_by: true, sort: :desc
    field :ip_address, "ip_address"
    field :date_of_birth, "date_of_birth", parser: &Parser.date!(&1, "{0M}/{0D}/{0YYYY}")
  end
end

It is possible to define the schema with string: param in order to directly use a string to generate content

@data """
id,first_name,last_name,email,gender,ip_address,date_of_birth
1,Ivory,Overstreet,[email protected],Female,30.138.91.62,10/22/2018
2,Ulick,Vasnev,[email protected],Male,35.15.164.70,01/19/2018
3,Chloe,Freemantle,[email protected],Female,133.133.113.255,08/13/2018
"""

schema data: @data do
...
end

Note that it's not a requirement to map all fields, but every field mapped must have a column in csv file. For example the following field configuration will result in a compilation error:

field :id, "non_existing_id", ...

Schema could be configured using a custom separator (default is ?,)

use Csv.Schema, separator: ?,

Moreover it's possible to configure if csv file has or has not an header. Depending on header param value field config changes:

# Default header value is `true`
use Csv.Schema
# Csv with header
schema path: "path/to/person.csv" do
  field :id, "id", key: true
  ...
end

# Csv without header. Note that field 1 is binded with the first csv column.
use Csv.Schema, header: false
# Index goes from 1 to N
schema path: "path/to/person.csv" do
  field :id, 1, key: true
  ...
end

Now Person module is a struct, defined like this:

defmodule Person do
  defstruct id: nil,
            first_name: nil,
            last_name: nil,
            email: nil,
            gender: nil,
            ip_address: nil,
            date_of_birth: nil
end

This macro creates for you inside Person module those functions:

def by_id(integer_key), do: ...

def filter_by_first_name(string_value), do: ...

def by_email(string_value), do: ...

def filter_by_gender(string_value), do: ...

def get_all, do: ...

Where:

  • by_id returns a %Person{} or nil if key is not mapped in csv
  • filter_by_first_name returns a [%Person{}, %Person{}, ...] or [] if input predicate does not match any person
  • by_email returns a %Person{} or nil if no person have provided email in csv
  • filter_by_gender returns a [%Person{}, %Person{}, ...] or [] if input predicate does not match any person gender
  • get_all return all csv rows as a Stream

Field configuration

Every field should be formed like this:

field {struct_field}, {csv_header}, {opts}

where:

  • {struct_field} will be the struct field name. Could be configured as string or as atom
  • {csv_header} is the csv column name from where get values. Must be configured using string only
  • {opts} is a keyword list containing special configurations

opts:

  • :key : boolean. At most one key could be set. If set to true creates the by_{name} function for you.
  • :unique : boolean. If set to true creates the by_{name} function for you. All csv values must be unique or an exception is raised
  • :filter_by : boolean. If set to true creates the filter_by_{name} function
  • :parser : function. An arity 1 function used to map values from string to a custom type
  • :sort : :asc or :desc. It sorts according to Erlang's term ordering with nil exception (number < atom < reference < fun < port < pid < tuple < list < bit-string < nil)
  • :join : string. If present it joins the given fields into a binary using the separator

Note that every configuration is optional

Keep in mind

Compilation time increase in an exponential manner if csv contains lots of lines and you configure multiple fields candidate for method creation (flags key, unique and/or filter_by set to true).

Because "without data you're just another person with an opinion" here some data:

Compilation time

csv rows key unique filter_by compile time
1_000 false 0 0 301_727 µs
1_000 false 2 0 352_522 µs
1_000 false 0 4 318_225 µs
1_000 true 0 0 334_240 µs
1_000 true 1 1 348_697 µs
1_000 true 2 0 406_367 µs
1_000 true 0 4 385_850 µs
1_000 true 2 2 414_617 µs
1_000 true 2 4 446_155 µs
5_000 false 0 0 2_734_565 µs
5_000 false 2 0 3_450_438 µs
5_000 false 0 4 3_464_593 µs
5_000 true 0 0 3_084_923 µs
5_000 true 1 1 3_795_718 µs
5_000 true 2 0 3_752_112 µs
5_000 true 0 4 3_387_067 µs
5_000 true 2 2 3_839_068 µs
5_000 true 2 4 4_113_228 µs
10_000 false 0 0 6_889_505 µs
10_000 false 2 0 8_667_683 µs
10_000 false 0 4 8_606_961 µs
10_000 true 0 0 7_892_421 µs
10_000 true 1 1 8_449_838 µs
10_000 true 2 0 9_507_693 µs
10_000 true 0 4 10_339_080 µs
10_000 true 2 2 10_518_744 µs
10_000 true 2 4 10_480_884 µs

Execution time

csv rows key unique filter_by by avg by tot filter_by avg filter_by tot
1_000 true 1 1 0.74 µs/op 74_412 µs 0.89 µs/op 89_275 µs
5_000 true 1 1 0.79 µs/op 79_776 µs 1.18 µs/op 118_786 µs
10_000 true 1 1 0.78 µs/op 78_908 µs 1.83 µs/op 183_642 µs

Execution details

Executed on my machine:

Lenovo Thinkpad T480
CPU: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
RAM: 32GB

Try yourself

If you like to run compilation benchmarks yourself:

iex -S mix
c "benchmark/timings.exs"

Copyright and License

Copyright (c) 2019 PrimaIt

This work is free. You can redistribute it and/or modify it under the terms of the MIT License. See the LICENSE.md file for more details.