F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in HPC Systems

DOI10.5281/zenodo.11467483Zenodo11467483MaRDI QIDQ6725029FDOQ6725029

Dataset published at Zenodo repository.

Jens Domke, Andrea Bartolini, Francesco Antici, Zeynep Kiziltan, Keiji Yamamoto

Publication date: 5 June 2024

Copyright license: Creative Commons Attribution 4.0 International

F-DATA is a novel workload dataset containing the data of around 24 million jobs executed on Supercomputer Fugaku, over the three years of public system usage (March 2021-April 2024). Each job data contains an extensive set of features, such as exit code, duration, power consumption and performance metrics (e.g. #flops, memory bandwidth, operational intensity and memory/compute bound label), which allows for a multitude of job characteristics prediction. The full list of features can be found in the filefeature_list.csv. The sensitive data appears both in anonymized and encoded versions. The encoding is based on a Natural Language Processing model and retains sensitive but useful job information for prediction purposes, without violating data privacy. The scripts used to generate the dataset are available in the F-DATA GitHub repository, along with a series of plots and instruction on how to load the data. F-DATA is composed of 38 files, with eachYY_MM.parquet file containing the data of the jobs submitted in the month MM of the year YY. The files of F-DATA are saved as .parquet files. It is possible to load such files as dataframes by leveraging the pandas APIs, after installing pyarrow (pip install pyarrow). A single file can be read with the following Python instrcutions: # Importing pandas library import pandas as pd # Read the 21_01.parquet file in a dataframe format df = pd.read_parquet("21_01.parquet") df.head()

This page was built for dataset: F-DATA: A Fugaku Workload Dataset for Job-centric Predictive Modelling in HPC Systems