Data for research on automatic cloud configuration. The benchmarks are part of the HiBench benchmark. Please take a look at the benchmark respository to get more information about the benchmarks and the input datasets.
This is the dataset accompanying the github repository for our research work:
Do the Best Cloud Configurations Grow on Trees? An Experimental Evaluation of Black Box Algorithms for Optimizing Cloud Workloads.
Muhammad Bilal, Marco Serafini, Marco Canini and Rodrigo Rodrigues.
Proceedings of the VLDB Endowment, 13(11).
The directories in the dataset
folder use the following naming scheme:
<num of instances>_<instance type>_<benchmark>_<distributed framework>_<input size>_1
For example:
1_c5.4xlarge_linear_spark_huge_1
is a directory that contains data on the run of the linear
regression spark
benchmark using huge
input dataset on a single instance of c5.4xlarge
type.
Data inside the each directory contains four types of files:
bench.log
: contains spark output logs of the benchmark runerror.log
: contains either1
or0
indicating whether there is an error in the spark output logs or not.log
: contains the spark output log as well as some other information from the script that ran the benchmarkreport.json
: contains the following fields:completed
: Whether the benchmark completely successfullydatasize
: Input datasizeelapsed_time
: Execution time of the benchmark on this cloud configurationwall_time
: Same aselapsed_time
in this caseframework
: Distributed framework used for the benchmarkworkload
: Benchmark name
sar_node*.csv
: The low-level metrics collected usingsar
utility. Each node of the cloud configuration has its own metrics file. The files are enumerated fromsar_node1.csv
tosar_node<n>.csv
(total number of nodes in the configuration).
The dataset has been compressed in order to upload it to github.
- Clone the github repository:
git clone https://github.com/MBtech/bbo-arena-dataset.git
- Uncompress the dataset:
./uncompress.sh