An Empirical Study of Container Image Configurations and Their Impact on Start Times (Container Image Data)

DOI10.5281/zenodo.7602500ZenodoMaRDI QIDQ6722291FDO

Authors Nikolas Herbst, Kyle Chard, Ian Foster, Robert Leppich, Samuel Kounev, Martin Straesser, André Bauer

Publication date 3 February 2023

Copyright license Creative Commons Attribution 4.0 International

Dataset with the container image metadata used for our IEEE/ACM CCGRID 2023 paper An Empirical Study of Container Image Configurations and Their Impact on Start Times. Abstract of the paper: A core selling point of application containers is their fast start times compared to other virtualization approaches like virtual machines. Predictable and fast container start times are crucial for improving and guaranteeing the performance of containerized cloud, serverless, and edge applications. While previous work has investigated container starts, there remains a lack of understanding of how start times may vary across container configurations. We address this shortcoming by presenting and analyzing a dataset of approximately 200,000 open-source Docker Hub images featuring different image configurations (e.g., image size and exposed ports). Leveraging this dataset, we investigate the start times of containers in two environments and identify the most influential features. Our experiments show that container start times can vary between hundreds of milliseconds and tens of seconds in the same environment. Moreover, we conclude that no single dominant configuration feature determines a containers start time and that hardware and software parameters must be considered together for an accurate assessment. Dataset description: Our images dataset contains 200,986 entries with 21 features associated to each container image. In the following, we describe the meaning of each feature. Further information is available in OCI Image Specification and the Docker Run Documentation. Besides the 20 features grouped in the five categories below, each dataset entry has a image_id, which is used to uniquely identify the dataset entry. Features Metadata features (prefix: meta) meta_repo_digest : The repo digest is a SHA-256 hash which is used to uniquely identify and pull the image from Docker Hub meta_architecture : The CPU architecture which the binaries in the image are built to run on meta_os : The name of the operating system which the image is built to run on meta_docker_version : The Docker version used to built this image I/O stream features (prefix: io) io_attach_stdin : boolean setting to determine whether the console should be attached to the process stdin stream io_attach_stdout : boolean setting to determine whether the console should be attached to the process stdout stream io_attach_stderr : boolean setting to determine whether the console should be attached to the process stderr stream io_tty : boolean setting to determine whether the console should pretend to be a TTY when attached io_open_std_in : boolean setting to determine whether the process stdin stream should be kept open even if console not attached io_std_in_once : boolean setting to determine whether the process retrieved input from the stdin stream at least once Start command features (prefix: cmd) cmd_args : Length of list of arguments to use as the command to execute when the container starts cmd_envvars : Environment variables set per default when the container starts cmd_additional_args : Length of list for additional arguments to the containers entrypoint File system features (prefix: fs) fs_volumes : Number of volumes to create/use by default fs_size : Size of this image in bytes fs_virtual_size : Virtual size of this image in bytes (equals size) fs_graph_driver_name : Name of the images graph driver fs_root_fs_type : Name of the file system type used in the image fs_layers : Number of root file system layers Networking features (prefix: net) net_ports : Number of ports to expose per default Dataset acquisition: The dataset has been acquired from Docker Hub using a web crawler. We used substring matches with the Docker Hub Explore function. As search strings, we used all letter combination with sizes 1 to 3, meaning that our first search string was a and our last was zzz. We included both results from the recently updated and the most popular selection. We came up with an initial list of 286,294 image names. We then tested we could pull and start these images once. These tests have been conducted from April to June 2022. We sorted out all images that were either not pullable or startable and retrieved all total of 200,986 valid images. In the following, we describe the error types that we encountered and that let to the removal of the causing image from the dataset: The image manifest was unknown when we tried to download it meaning that is has been renamed or deleted from the time when our web crawler was running The entrypoint command required a dependency that was missing in the image and therefore the container could not be started The image did not specify an entrypoint command and could therefore not be started The image declared an invalid root file system type The image had a malformed root file system The image configuration was incomplete and therefore not all required data could be obtained See also our CodeOcean capsule with the processing scripts for our paper: https://doi.org/10.24433/CO.4595026.v2

This page was built for dataset: An Empirical Study of Container Image Configurations and Their Impact on Start Times (Container Image Data)