Project:ImporterDocumentation: Difference between revisions

From MaRDI portal
No edit summary
No edit summary
Line 6: Line 6:
Methods:
Methods:


* write_data_dump(self)
* ''write_data_dump(self)''
* process_data(self)
* ''process_data(self)''
 
===== ZBMathSource(ADataSource) =====
Class for reading data from the ZBMath API.
 
Methods:
 
* ''__init__(self, out_dir, tags, from_date=None, until_date=None, raw_dump_path=None)''
** parameters:
*** out_dir: output directory for data dump and processed data
*** tags: tags to look for in xml
*** from_date: earliest publication date; default is None
*** until_date: latest publication date; default is None
*** raw_dump_path: this is required, if a data dump has already been created only process_data will be called
* ''write_data_dump(self)''
** overrides abstract method
** uses sickle to query ZBMath API and get a complete data dump with the oai_zb_preview metadata prefix
* ''process_data(self)''
** overrides abstract method
** reads data dump and outputs a file with the processed data in csv format
** processes each record from ZBMath API response separately to reduce memory requirements
** where there is no information for the tags ''author, document_title, language, keywords, publication_year'' or ''serial'', the doi is queried to retrieve this information; if nothing is found, the value is set to None


== Data sources ==
== Data sources ==


== Design decisions ==
== Design decisions ==

Revision as of 12:34, 20 April 2022

Classes

ADataSource

Abstract base class for reading data from external sources.

Methods:

  • write_data_dump(self)
  • process_data(self)
ZBMathSource(ADataSource)

Class for reading data from the ZBMath API.

Methods:

  • __init__(self, out_dir, tags, from_date=None, until_date=None, raw_dump_path=None)
    • parameters:
      • out_dir: output directory for data dump and processed data
      • tags: tags to look for in xml
      • from_date: earliest publication date; default is None
      • until_date: latest publication date; default is None
      • raw_dump_path: this is required, if a data dump has already been created only process_data will be called
  • write_data_dump(self)
    • overrides abstract method
    • uses sickle to query ZBMath API and get a complete data dump with the oai_zb_preview metadata prefix
  • process_data(self)
    • overrides abstract method
    • reads data dump and outputs a file with the processed data in csv format
    • processes each record from ZBMath API response separately to reduce memory requirements
    • where there is no information for the tags author, document_title, language, keywords, publication_year or serial, the doi is queried to retrieve this information; if nothing is found, the value is set to None

Data sources

Design decisions