Portal/rdm/examples/weather prediction

Example: RDM in a hypothetical project on weather prediction

Project description

This project plan demonstrates a blueprint how the data can be managed in a project of weather prediction using publicly available data sources. It ensures data quality, accessibility, and potential contributions to future research.

1. Data description

Existing Data Reuse: Leverage publicly available historical weather data from a source like the National Oceanic and Atmospheric Administration (NOAA) or the European Centre for Medium-Range Weather Forecasts (ECMWF). Data Types: Measurement data (temperature, precipitation, humidity, wind speed/direction, pressure). Data Processing: Algorithms. Moving Average Filter: This algorithm can smooth out short-term fluctuations in the data, potentially revealing underlying trends relevant to weather prediction (Alerskans and Kaas, 2021). Standard Deviation Filter: This can identify outliers that deviate significantly from the average, potentially indicating errors or unusual weather events (Grönquist et al., 2021). Regularised online forecaster: This algo- rithm is based on the regularised prediction schemes which returns non-parametric prediction rules (Jézéquel et al., 2019) in (possibly) infinite-dimensional spaces. Feature Engineering: Create new variables relevant to prediction (e.g., temperature difference from previous day, dew point). Data Volume: Public weather datasets can be quite large, depending on the chosen timeframe and spatial resolution. We may need to download and process data in chunks.

2. Documentation and data quality

Metadata: Record metadata like the source agency, data format (e.g., CSV), time period covered, spatial resolution (e.g., zip code, city), any other data processing steps applied. Quality Control: Perform data quality checks for outliers, inconsistencies, and missing values. Statistical methods like q-q plots and analysis of interquartile range (IQR) can identify potential anomalies. Software Tools: Python programming language with following libraries: Pandas, NumPy, and scikit-learn will be used for data analysis and model building.

3. Storage and technical archiving the project

Storage: The downloaded data to be stored in a secure, version-controlled repository like zenodo. Processed data (cleaned, engineered) will be saved as separate files for clarity. Data Security: The repository will be restricted to project members. The results of the experiments will obtain open access on the repository. Downloaded data might be compressed to save storage space.

4. Legal obligations and conditions

Data Source: Comply with the chosen source’s data usage poli- cies and attribution requirements. Publication: Datasets are often publicly available, but publica- tions should acknowledge the source and take into account any specific licenses associated with the data. Copyright: Public weather data is typically not copyrighted, but it’s important to check the source’s specific terms.

5. Data exchange and long-term data accessibility

Data Sharing: Consider sharing our processed data (cleaned, potentially with additional features) along with the code used for analysis in a public repository (GitHub, GitHub release on zenodo). Retention: The raw data and processed data will be retained for at least ten years to facilitate potential model improvements or future research (which is supported for example by zenodo.org repository). Accessibility: Shared data and code will be accompanied by clear documentation explaining data format, processing steps, and model details to ensure usability by others.

6. Responsibilities and resources

Data Acquisition Person: One team member will be responsible for downloading data from the chosen source, managing data quality checks, and ensuring compliance with data usage policies. Statistical Modeling Person: Another team member will be responsible for data analysis, feature engineering, model development, and evaluation. Resources: Time for data acquisition, cleaning, analysis, and model development. Computational resources for statistical analysis might require additional allocation, depending on data volume and model complexity. Data Curation: After project completion, designated team members will be responsible for uploading processed data and code to chosen repositories, maintaining access for the defined retention period, and potentially updating documentation based on final model selection.

In case we need to publish additional data, we will follow the MaRDI guidelines on data management and how to choose an appropriate repository.