Python implementation of Gunnar's 1 billion row challenge:
First install the Python requirements:
python3 -m pip install -r requirements.txt
The script createMeasurements.py
will create the measurement file:
usage: createMeasurements.py [-h] [-o OUTPUT] [-r RECORDS]
Create measurement file
options:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Measurement file name (Default is measurements.txt)
-r RECORDS, --records RECORDS
Number of records to create (Default is 1000000000)
Example:
% python3 createMeasurements.py
Creating measurement file 'measurements.txt' with 1,000,000,000 measurements...
- Wrote 10,000,000 measurements in 8.92 seconds
- Wrote 20,000,000 measurements in 17.82 seconds
- Wrote 30,000,000 measurements in 26.73 seconds
- Wrote 40,000,000 measurements in 35.54 seconds
- Wrote 50,000,000 measurements in 44.36 seconds
- Wrote 60,000,000 measurements in 53.07 seconds
.
.
.
- Wrote 980,000,000 measurements in 880.98 seconds
- Wrote 990,000,000 measurements in 889.99 seconds
Created file 'measurements.txt' with 1,000,000,000 measurements in 898.92 seconds
Be patient as it can take more than 15 minutes to have the file generated.
Maybe as another challenge is to speed up the generation of the measurements file 🙂
Interpreter | Script | user | system | cpu | total |
---|---|---|---|---|---|
python3 | calculateAveragePolars.py | 77.84 | 3.64 | 703% | 11.585 |
pypy3 | calculateAveragePypy.py | 135.25 |
2.92 |
735% |
18.782 |
python3 | calculateAverageDuckDB.py | 186.78 | 4.21 | 806% | 23.673 |
pypy3 | calculateAverage.py | 242.89 |
6.28 |
780% |
31.926 |
python3 | calculateAverage.py | 329.20 |
3.77 |
793% |
41.941 |
python3 | calculateAveragePypy.py | 510.93 |
1.88 |
793% |
64.660 |
The script calculateAveragePolars.py
was suggested by Taufan on this post.
The script calculateAveragePypy.py
was created by donalm, a 2x improved version of the initial script (calculateAverage.py
) when running in pypy3, even capable of beating the implementation using DuckDB calculateAverageDuckDB.py
.
Olivier Scalbert has made a simple but incredible suggestion where performance increased by an average of 15% (table above has been updated), thank you 🙂
His suggestions were to change from:
if measurement < result[location][0]:
result[location][0] = measurement
if measurement > result[location][1]:
result[location][1] = measurement
result[location][2] = measurement
result[location][3] = 1
to:
_result = result[location]
if measurement < _result[0]:
_result[0] = measurement
if measurement > _result[1]:
_result[1] = measurement
_result[2] = measurement
_result[3] = 1
Python can be surprising sometimes.