From MaRDI portal
Revision as of 13:50, 20 April 2022 by Larissa (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)



Abstract base class for reading data from external sources.


  • write_data_dump(self)
  • process_data(self)

Class for reading data from the zbMath API using the listRecords endpoint.


  • __init__(self, out_dir, tags, from_date=None, until_date=None, raw_dump_path=None)
    • parameters:
      • out_dir: output directory for data dump and processed data
      • tags: tags to look for in xml
      • from_date: earliest publication date; default is None
      • until_date: latest publication date; default is None
      • raw_dump_path: if a data dump has already been created and only process_data should be called, this is required
  • write_data_dump(self)
    • overrides abstract method
    • uses sickle to query zbMath API with the oai_zb_preview metadata prefix and write a complete raw data dump
  • process_data(self)
    • overrides abstract method
    • reads data dump and outputs a file with the processed data in csv format
    • processes each record from zbMath API response separately to reduce memory requirements
    • where there is no information for the tags author, document_title, language, keywords, publication_year or serial, the doi is queried with the Crossref API using the habanero package to retrieve this information; if nothing is found, the value is set to None

Data sources