Skip to content

gerrymanoim/libpy_simdjson

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

85 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

libpy simdjson

On Master Merge PyPI version

Python bindings for simdjson using libpy.

Requirements

  • OS: macOS>10.15, linux.
  • Compiler: gcc>=9, clang >= 10 (C 17 code)
  • Python: libpy>=0.2.3, numpy.

Install

pip install libpy-simdjson

Note: The installation of libpy (required by libpy_simdjson) will use the python executable to figure out information about your environment. If you are not using a virtual environment or python does not point to the Python installation you want to use (checked with which python and python --version) you must point to your Python executable using the PYTHON environment variable, i.e. PYTHON=python3 make or PYTHON=python3 pip3 install libpy. Additionally, make sure that your CC and CXX environment variables point to the correct compilers.

Usage

from pathlib import Path
import libpy_simdjson as json
doc = json.load(Path("twitter.json"))
# or json.load(b"twitter.json")
# or json.load("twitter.json")
# we also support `loads` for strings.

doc is an Object. Objects act as python dicts with special methods.

isinstance(doc, json.Object)
True

We can grab keys, get the length, grab items, and access specific keys:

len(doc)
2
doc.keys()
[b'statuses', b'search_metadata']
doc[b'search_metadata'].items()
[(b'completed_in', 0.087),
 (b'max_id', 505874924095815700),
 (b'max_id_str', b'505874924095815681'),
 (b'next_results',
  b'?max_id=505874847260352512&q=ไธ€&count=100&include_entities=1'),
 (b'query', b'ไธ€'),
 (b'refresh_url',
  b'?since_id=505874924095815681&q=ไธ€&include_entities=1'),
 (b'count', 100),
 (b'since_id', 0),
 (b'since_id_str', b'0')]

If you every want an actual python dictionary, use as_dict:

doc[b'search_metadata'].as_dict()
{b'completed_in': 0.087,
 b'max_id': 505874924095815700,
 b'max_id_str': b'505874924095815681',
 b'next_results': b'?max_id=505874847260352512&q=ไธ€&count=100&include_entities=1',
 b'query': b'ไธ€',
 b'refresh_url': b'?since_id=505874924095815681&q=ไธ€&include_entities=1',
 b'count': 100,
 b'since_id': 0,
 b'since_id_str': b'0'}

However, we also support JSON Pointer sytnax via at_pointer. This will be much faster if you know what you're looking for:

doc.at_pointer(b"/statuses/50/created_at")
b'Sun Aug 31 00:29:04  0000 2014'
doc.at_pointer(b"/statuses/50/text").decode()
'RT @Ang_Angel73: ้€ขๅ‚ใ€Œใใฃโ€ฆๅƒ•ใฎ็ง˜ใ‚ใ‚‰ใ‚Œใ—ๅณ็›ฎใŒโ€ฆ๏ผใ€\nไธ€ๅŒใ€Œโ€ฆโ€ฆโ€ฆโ€ฆโ€ฆใ€‚ใ€'

Let's look at statuses

statuses = doc[b'statuses']

statuses is an Array. Arrays act like python lists with special methods.

Note: statuses and doc share a single parser instance. We cannot parse a new document while these objects are alive (though we can create new parsers via libpy_simdjson.Parser.load.

isinstance(statuses, json.Array)
True

Arrays support length, indexing, iteration:

len(statuses)
100
statuses[0][b'text'].decode()
'@aym0566x \n\nๅๅ‰:ๅ‰็”ฐใ‚ใ‚†ใฟ\n็ฌฌไธ€ๅฐ่ฑก:ใชใ‚“ใ‹ๆ€–ใฃ๏ผ\nไปŠใฎๅฐ่ฑก:ใจใ‚Šใ‚ใˆใšใ‚ญใƒขใ„ใ€‚ๅ™›ใฟๅˆใ‚ใชใ„\nๅฅฝใใชใจใ“ใ‚:ใถใ™ใงใ‚ญใƒขใ„ใจใ“๐Ÿ˜‹โœจโœจ\nๆ€ใ„ๅ‡บ:ใ‚“ใƒผใƒผใƒผใ€ใ‚ใ‚Šใ™ใŽ๐Ÿ˜Šโค๏ธ\nLINEไบคๆ›ใงใใ‚‹๏ผŸ:ใ‚ใโ€ฆโ€ฆใ”ใ‚ใ‚“โœ‹\nใƒˆใƒ—็”ปใ‚’ใฟใฆ:็…งใ‚Œใพใ™ใŒใช๐Ÿ˜˜โœจ\nไธ€่จ€:ใŠๅ‰ใฏไธ€็”Ÿใ‚‚ใ‚“ใฎใƒ€ใƒ๐Ÿ’–'
for status in statuses:
    # this is a bad example but you get the picture
    if status[b'id'] % 2 == 0:
        print(status[b"text"].decode())
        break
else:
    print("no even ids?")
@aym0566x

ๅๅ‰:ๅ‰็”ฐใ‚ใ‚†ใฟ
็ฌฌไธ€ๅฐ่ฑก:ใชใ‚“ใ‹ๆ€–ใฃ๏ผ
ไปŠใฎๅฐ่ฑก:ใจใ‚Šใ‚ใˆใšใ‚ญใƒขใ„ใ€‚ๅ™›ใฟๅˆใ‚ใชใ„
ๅฅฝใใชใจใ“ใ‚:ใถใ™ใงใ‚ญใƒขใ„ใจใ“๐Ÿ˜‹โœจโœจ
ๆ€ใ„ๅ‡บ:ใ‚“ใƒผใƒผใƒผใ€ใ‚ใ‚Šใ™ใŽ๐Ÿ˜Šโค๏ธ
LINEไบคๆ›ใงใใ‚‹๏ผŸ:ใ‚ใโ€ฆโ€ฆใ”ใ‚ใ‚“โœ‹
ใƒˆใƒ—็”ปใ‚’ใฟใฆ:็…งใ‚Œใพใ™ใŒใช๐Ÿ˜˜โœจ
ไธ€่จ€:ใŠๅ‰ใฏไธ€็”Ÿใ‚‚ใ‚“ใฎใƒ€ใƒ๐Ÿ’–

If you need to you can convert and Array to a list using as_list:

statuses.as_list()[1][b'metadata']
{b'result_type': b'recent', b'iso_language_code': b'ja'}

However, just like for Objects, we support JSON Pointers via at_pointer, which is much faster:

statuses.at_pointer(b"/33/created_at")
b'Sun Aug 31 00:29:06  0000 2014'

Benchmarks

Note - unlike most other python JSON parsers, libpy_simdjson will, by design, avoid converting to native python types until as late as possible, providing you with Object and Array objects instead. libpy allows you to work with these proxy objects as if they were actual python objects without incurring the cost of object conversion until actually needed. Because the C simdjson library is so effficient, converting to Python objects is by far the slowest part of parsing, so we strive to do this as late and on as few fields as possible.

See the (still WIP) "overhead over python dict access" benchmarks for object conversion overhead.

For the sake of comparison, we also benchmark a full Python object conversion in libpy_simdjson_as_py_obj though this is very much not the intended use case.


------------------------------------------------------------ benchmark 'Load /home/runner/work/libpy_simdjson/libpy_simdjson/libpy_simdjson/tests/jsonexamples/canada.json': 8 tests -------------------------------------------------------------
Name (time in ms)                                                                    Min                Max               Mean            StdDev             Median                IQR            Outliers       OPS            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_load[path0-libpy_simdjson-loads]                                   3.5037 (1.0)      10.9591 (1.0)       4.3319 (1.0)      0.6166 (1.0)       4.2607 (1.0)       0.2292 (1.0)         11;15  230.8454 (1.0)         162           1
test_benchmark_load[path0-pysimdjson-parse]                                       3.6885 (1.05)     11.1368 (1.02)      4.4029 (1.02)     0.7765 (1.26)      4.2611 (1.00)      0.5017 (2.19)          8;5  227.1254 (0.98)        118           1
test_benchmark_load[path0-pysimdjson_as_py_obj-loads]                            13.5282 (3.86)     37.4092 (3.41)     24.2264 (5.59)     6.5177 (10.57)    27.0374 (6.35)     11.9384 (52.08)        16;0   41.2773 (0.18)         40           1
test_benchmark_load[path0-libpy_simdjson_as_py_obj-libpy_simdjson_as_py_obj]     13.5544 (3.87)     44.5382 (4.06)     22.4503 (5.18)     7.1879 (11.66)    25.0067 (5.87)     12.0174 (52.43)        12;0   44.5427 (0.19)         35           1
test_benchmark_load[path0-orjson-loads]                                          16.1693 (4.61)     37.2228 (3.40)     25.2505 (5.83)     6.8427 (11.10)    27.4105 (6.43)     12.5579 (54.79)        19;0   39.6032 (0.17)         41           1
test_benchmark_load[path0-ujson-loads]                                           22.0310 (6.29)     45.6815 (4.17)     32.3445 (7.47)     7.0874 (11.49)    35.0020 (8.22)     12.6422 (55.15)        12;0   30.9171 (0.13)         27           1
test_benchmark_load[path0-python_json-loads]                                     49.6505 (14.17)    72.4977 (6.62)     62.0533 (14.32)    7.2277 (11.72)    64.0998 (15.04)    12.7639 (55.68)         8;0   16.1152 (0.07)         19           1
test_benchmark_load[path0-rapidjson-loads]                                       50.3836 (14.38)    76.2291 (6.96)     61.1768 (14.12)    7.7522 (12.57)    63.4637 (14.90)    12.2982 (53.65)         6;0   16.3461 (0.07)         17           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------



--------------------------------------------------------------------- benchmark 'Load /home/runner/work/libpy_simdjson/libpy_simdjson/libpy_simdjson/tests/jsonexamples/citm_catalog.json': 8 tests ---------------------------------------------------------------------
Name (time in us)                                                                        Min                    Max                   Mean                StdDev                 Median                   IQR            Outliers       OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_load[path3-libpy_simdjson-loads]                                     869.1500 (1.0)       5,930.8430 (2.49)      1,123.0447 (1.0)        203.5337 (1.52)      1,112.6140 (1.0)         81.2050 (1.02)        29;35  890.4365 (1.0)         722           1
test_benchmark_load[path3-pysimdjson-parse]                                         875.8530 (1.01)      2,384.6440 (1.0)       1,127.0993 (1.00)       133.4820 (1.0)       1,119.4185 (1.01)        79.8040 (1.0)         72;38  887.2333 (1.00)        786           1
test_benchmark_load[path3-libpy_simdjson_as_py_obj-libpy_simdjson_as_py_obj]      6,227.8560 (7.17)     23,751.5410 (9.96)     10,109.4607 (9.00)     5,149.2095 (38.58)     7,506.5770 (6.75)     1,209.3185 (15.15)       27;29   98.9172 (0.11)        120           1
test_benchmark_load[path3-orjson-loads]                                           6,445.2870 (7.42)     23,814.3440 (9.99)     10,414.0837 (9.27)     4,872.0612 (36.50)     7,897.2230 (7.10)     2,016.7280 (25.27)       26;26   96.0238 (0.11)        114           1
test_benchmark_load[path3-pysimdjson_as_py_obj-loads]                             7,940.2930 (9.14)     24,984.5340 (10.48)    11,900.6726 (10.60)    5,518.1670 (41.34)     8,878.9950 (7.98)     3,686.7190 (46.20)       21;21   84.0289 (0.09)         90           1
test_benchmark_load[path3-ujson-loads]                                            8,323.1930 (9.58)     23,228.4760 (9.74)     12,104.0124 (10.78)    5,363.4790 (40.18)     9,344.6530 (8.40)       802.3722 (10.05)       19;20   82.6172 (0.09)         83           1
test_benchmark_load[path3-python_json-loads]                                     12,697.9430 (14.61)    31,279.0080 (13.12)    16,691.7376 (14.86)    5,163.8324 (38.69)    14,104.4160 (12.68)    2,390.4543 (29.95)       17;17   59.9099 (0.07)         75           1
test_benchmark_load[path3-rapidjson-loads]                                       14,025.9210 (16.14)    29,509.2470 (12.37)    17,785.5409 (15.84)    5,061.8217 (37.92)    15,456.7050 (13.89)    1,756.3833 (22.01)         7;7   56.2254 (0.06)         35           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------



--------------------------------------------------------------- benchmark 'Load /home/runner/work/libpy_simdjson/libpy_simdjson/libpy_simdjson/tests/jsonexamples/github_events.json': 8 tests ---------------------------------------------------------------
Name (time in us)                                                                     Min                   Max                Mean              StdDev              Median                IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_load[path2-libpy_simdjson-loads]                                   25.4010 (1.0)      1,960.0230 (1.28)      36.2258 (1.0)       24.5956 (1.0)       34.6020 (1.0)       2.9500 (1.04)      72;1589       27.6046 (1.0)       12642           1
test_benchmark_load[path2-pysimdjson-parse]                                       25.8010 (1.02)     1,528.8010 (1.0)       37.9300 (1.05)      26.3332 (1.07)      37.3020 (1.08)      2.8010 (1.0)      189;1856       26.3644 (0.96)      16103           1
test_benchmark_load[path2-libpy_simdjson_as_py_obj-libpy_simdjson_as_py_obj]     142.1090 (5.59)     3,436.0130 (2.25)     207.5540 (5.73)      88.5276 (3.60)     204.8130 (5.92)     17.6010 (6.28)       58;289        4.8180 (0.17)       4034           1
test_benchmark_load[path2-orjson-loads]                                          175.7120 (6.92)     5,736.6740 (3.75)     254.9315 (7.04)     162.0033 (6.59)     239.3160 (6.92)     20.8005 (7.43)       26;226        3.9226 (0.14)       2872           1
test_benchmark_load[path2-pysimdjson_as_py_obj-loads]                            224.7150 (8.85)     2,283.7510 (1.49)     321.5699 (8.88)      72.4103 (2.94)     324.0220 (9.36)     26.0020 (9.28)       79;180        3.1097 (0.11)       2632           1
test_benchmark_load[path2-ujson-loads]                                           301.7190 (11.88)    7,409.5770 (4.85)     375.2140 (10.36)    180.4753 (7.34)     363.8240 (10.51)    29.0010 (10.35)      18;175        2.6651 (0.10)       2269           1
test_benchmark_load[path2-python_json-loads]                                     330.5200 (13.01)    2,521.5590 (1.65)     459.9910 (12.70)     86.1825 (3.50)     455.8290 (13.17)    38.7277 (13.83)     119;146        2.1740 (0.08)       1909           1
test_benchmark_load[path2-rapidjson-loads]                                       380.6250 (14.98)    2,082.2340 (1.36)     533.6130 (14.73)     91.0874 (3.70)     529.9340 (15.32)    43.7030 (15.60)       76;95        1.8740 (0.07)       1709           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------




----------------------------------------------------------------------- benchmark 'Load /home/runner/work/libpy_simdjson/libpy_simdjson/libpy_simdjson/tests/jsonexamples/mesh.json': 8 tests ------------------------------------------------------------------------
Name (time in us)                                                                       Min                    Max                   Mean                StdDev                 Median                 IQR            Outliers       OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_load[path4-libpy_simdjson-loads]                                    994.5520 (1.0)       4,240.8210 (1.06)      1,202.4742 (1.0)        194.6273 (1.04)      1,190.2620 (1.0)       93.9050 (1.31)        80;42  831.6187 (1.0)         772           1
test_benchmark_load[path4-pysimdjson-parse]                                      1,027.3570 (1.03)      4,001.3210 (1.0)       1,257.5495 (1.05)       187.9555 (1.0)       1,253.6190 (1.05)      71.9545 (1.0)         37;50  795.1973 (0.96)        804           1
test_benchmark_load[path4-pysimdjson_as_py_obj-loads]                            2,782.4560 (2.80)     16,753.7350 (4.19)      3,907.4828 (3.25)     2,316.1314 (12.32)     3,456.5430 (2.90)     264.9645 (3.68)         9;18  255.9192 (0.31)        252           1
test_benchmark_load[path4-libpy_simdjson_as_py_obj-libpy_simdjson_as_py_obj]     2,936.7520 (2.95)     19,099.6830 (4.77)      4,066.0482 (3.38)     2,579.4705 (13.72)     3,459.0780 (2.91)     291.3400 (4.05)        10;21  245.9390 (0.30)        207           1
test_benchmark_load[path4-orjson-loads]                                          3,848.3120 (3.87)     16,809.8280 (4.20)      4,853.2711 (4.04)     2,387.4255 (12.70)     4,348.8370 (3.65)     282.5400 (3.93)         8;13  206.0466 (0.25)        203           1
test_benchmark_load[path4-ujson-loads]                                           4,224.0310 (4.25)     27,094.9610 (6.77)      5,393.1933 (4.49)     2,651.7927 (14.11)     4,906.3150 (4.12)     368.6210 (5.12)         7;13  185.4189 (0.22)        196           1
test_benchmark_load[path4-python_json-loads]                                     7,946.7190 (7.99)     21,914.9540 (5.48)      9,216.2188 (7.66)     2,458.6174 (13.08)     8,729.1600 (7.33)     425.9220 (5.92)          4;8  108.5044 (0.13)        108           1
test_benchmark_load[path4-rapidjson-loads]                                       8,958.0800 (9.01)     26,303.9020 (6.57)     10,523.4513 (8.75)     2,727.4495 (14.51)    10,081.1370 (8.47)     748.3645 (10.40)         4;4   95.0259 (0.11)        101           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


---------------------------------------------------------------------- benchmark 'Load /home/runner/work/libpy_simdjson/libpy_simdjson/libpy_simdjson/tests/jsonexamples/twitter.json': 8 tests ----------------------------------------------------------------------
Name (time in us)                                                                       Min                    Max                  Mean                StdDev                Median                 IQR            Outliers         OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_load[path1-libpy_simdjson-loads]                                    329.3220 (1.0)       3,357.5250 (2.67)       440.8892 (1.0)         84.7542 (1.65)       441.1300 (1.0)       37.4280 (1.22)        64;75  2,268.1435 (1.0)        1637           1
test_benchmark_load[path1-pysimdjson-parse]                                        348.7250 (1.06)      1,258.3910 (1.0)        451.6477 (1.02)        51.2426 (1.0)        450.8825 (1.02)      30.8030 (1.0)       374;155  2,214.1149 (0.98)       2006           1
test_benchmark_load[path1-libpy_simdjson_as_py_obj-libpy_simdjson_as_py_obj]     2,206.2470 (6.70)     14,217.4510 (11.30)    2,759.9331 (6.26)     1,386.3121 (27.05)    2,551.0705 (5.78)     312.4220 (10.14)         2;3    362.3276 (0.16)         74           1
test_benchmark_load[path1-orjson-loads]                                          2,639.8850 (8.02)     15,075.4730 (11.98)    3,420.6474 (7.76)     1,657.3556 (32.34)    3,183.0750 (7.22)     215.1150 (6.98)         6;15    292.3423 (0.13)        270           1
test_benchmark_load[path1-ujson-loads]                                           3,304.1320 (10.03)    18,557.3880 (14.75)    4,286.1597 (9.72)     1,868.0725 (36.46)    4,021.4820 (9.12)     283.4200 (9.20)         5;14    233.3091 (0.10)        239           1
test_benchmark_load[path1-pysimdjson_as_py_obj-loads]                            3,319.1400 (10.08)    16,355.6780 (13.00)    4,237.6133 (9.61)     1,642.9272 (32.06)    3,982.8870 (9.03)     306.1198 (9.94)         6;16    235.9819 (0.10)        259           1
test_benchmark_load[path1-python_json-loads]                                     4,154.6810 (12.62)    17,041.0680 (13.54)    5,474.9570 (12.42)    1,520.8511 (29.68)    5,262.9090 (11.93)    484.4330 (15.73)         6;8    182.6498 (0.08)        190           1
test_benchmark_load[path1-rapidjson-loads]                                       5,184.5590 (15.74)    19,217.2170 (15.27)    6,598.9679 (14.97)    1,949.8404 (38.05)    6,315.0330 (14.32)    682.2923 (22.15)         5;7    151.5388 (0.07)        149           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


-------------------------------------------------------------------------------------------- benchmark 'Random attribute access': 2 tests --------------------------------------------------------------------------------------------
Name (time in ns)                                  Min                       Max                  Mean                StdDev                Median                 IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_at[python_json-loads]          800.0000 (1.0)      1,560,881.0000 (6.79)     1,190.4413 (1.0)      6,373.7290 (3.79)     1,100.0000 (1.0)      100.0000 (1.0)      122;3966      840.0246 (1.0)       88496           1
test_benchmark_at[libpy_simdjson-loads]     1,300.0000 (1.63)       229,812.0000 (1.0)      1,848.6849 (1.55)     1,683.4805 (1.0)      1,800.0000 (1.64)     400.0000 (4.00)      143;234      540.9251 (0.64)      57801           1
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


---------------------------------------------------------------------------------------------------- benchmark 'Random list access': 2 tests ----------------------------------------------------------------------------------------------------
Name (time in ns)                                         Min                       Max                   Mean                StdDev                 Median                   IQR            Outliers  OPS (Kops/s)            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_benchmark_list_access[python_json-loads]        700.0000 (1.0)        258,213.0000 (1.0)       1,061.7637 (1.0)      1,165.4286 (1.0)       1,000.0000 (1.0)        200.0000 (1.0)      570;5297      941.8292 (1.0)      149232           1
test_benchmark_list_access[libpy_simdjson-loads]     700.0000 (1.00)     1,340,168.0000 (5.19)     10,658.8974 (10.04)    9,211.5111 (7.90)     10,301.0000 (10.30)    9,300.0000 (46.50)    4637;493       93.8183 (0.10)      81961           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Legend:
  Outliers: 1 Standard Deviation from Mean; 1.5 IQR (InterQuartile Range) from 1st Quartile and 3rd Quartile.
  OPS: Operations Per Second, computed as 1 / Mean