Portal/rdm/examples/weather prediction: Difference between revisions
Zadorozhnyi (talk | contribs) minor change |
Add author section. |
||
Line 58: | Line 58: | ||
[https://www.mardi4nfdi.de MaRDI] guidelines on data management and how to | [https://www.mardi4nfdi.de MaRDI] guidelines on data management and how to | ||
choose an appropriate repository. | choose an appropriate repository. | ||
====== Author ====== | |||
This example was authored by [https://www.math.cit.tum.de/math/personen/wissenschaftliches-personal/oleksandr-zadorozhnyi/ Oleksandr Zadorozhnyi] from Technische Universität München. If you have questions, please do not hesitate to get in touch with us at the [https://www.mardi4nfdi.de/community/help-desk MaRDI helpdesk]. | |||
== Notes == | == Notes == | ||
<references /> | <references /> |
Revision as of 13:23, 16 May 2024
Example: RDM in a hypothetical project on weather prediction
Project description
This project plan demonstrates a blueprint how the data can be managed in a project of weather prediction using publicly available data sources. It ensures data quality, accessibility, and potential contributions to future research.
1. Data description
Existing Data Reuse: Leverage publicly available historical weather data from a source like the National Oceanic and Atmospheric Administration (NOAA) or the European Centre for Medium-Range Weather Forecasts (ECMWF). Data Types: Measurement data (temperature, precipitation, humidity, wind speed/direction, pressure). Data Processing: Algorithms. Moving Average Filter: This algorithm can smooth out short-term fluctuations in the data, potentially revealing underlying trends relevant to weather prediction [1]. Standard Deviation Filter: This can identify outliers that deviate significantly from the average, potentially indicating errors or unusual weather events [2]. Regularised online forecaster: This algorithm is based on the regularised prediction schemes which returns non-parametric prediction rules [3] in (possibly) infinite-dimensional spaces. Feature Engineering: Create new variables relevant to prediction (e.g., temperature difference from previous day, dew point). Data Volume: Public weather datasets can be quite large, depending on the chosen timeframe and spatial resolution. We may need to download and process data in chunks.
2. Documentation and data quality
Metadata: Record metadata like the source agency, data format (e.g., CSV), time period covered, spatial resolution (e.g., zip code, city), any other data processing steps applied. Quality Control: Perform data quality checks for outliers, inconsistencies, and missing values. Statistical methods like q-q plots and analysis of interquartile range (IQR) can identify potential anomalies. Software Tools: Python programming language with following libraries: Pandas, NumPy, and scikit-learn will be used for data analysis and model building.
3. Storage and technical archiving the project
Storage: The downloaded data to be stored in a secure, version-controlled repository like zenodo. Processed data (cleaned, engineered) will be saved as separate files for clarity. Data Security: The repository will be restricted to project members. The results of the experiments will obtain open access on the repository. Downloaded data might be compressed to save storage space.
4. Legal obligations and conditions
Data Source: Comply with the chosen source’s data usage policies and attribution requirements. Publication: Datasets are often publicly available, but publications should acknowledge the source and take into account any specific licenses associated with the data. Copyright: Public weather data is typically not copyrighted, but it’s important to check the source’s specific terms.
5. Data exchange and long-term data accessibility
Data Sharing: Consider sharing our processed data (cleaned, potentially with additional features) along with the code used for analysis in a public repository (GitHub, GitHub release on zenodo). Retention: The raw data and processed data will be retained for at least ten years to facilitate potential model improvements or future research (which is supported for example by zenodo.org repository). Accessibility: Shared data and code will be accompanied by clear documentation explaining data format, processing steps, and model details to ensure usability by others.
6. Responsibilities and resources
Data Acquisition Person: One team member will be responsible for downloading data from the chosen source, managing data quality checks, and ensuring compliance with data usage policies. Statistical Modeling Person: Another team member will be responsible for data analysis, feature engineering, model development, and evaluation. Resources: Time for data acquisition, cleaning, analysis, and model development. Computational resources for statistical analysis might require additional allocation, depending on data volume and model complexity. Data Curation: After project completion, designated team members will be responsible for uploading processed data and code to chosen repositories, maintaining access for the defined retention period, and potentially updating documentation based on final model selection.
In case we need to publish additional data, we will follow the
MaRDI guidelines on data management and how to
choose an appropriate repository.
Author
This example was authored by Oleksandr Zadorozhnyi from Technische Universität München. If you have questions, please do not hesitate to get in touch with us at the MaRDI helpdesk.