importer

Import target catalog dumps into a SQL database.

base_dump_extractor

Base class for catalog dumps extraction.

class soweego.importer.base_dump_extractor.BaseDumpExtractor[source]

Method definitions to download catalog dumps, extract data, and populate a database instance.

extract_and_populate(dump_file_paths, resolve)[source]

Extract relevant data and populate SQLAlchemy ORM entities accordingly. Entities will be then persisted to a database instance.

Parameters
  • dump_file_paths (List[str]) – paths to downloaded catalog dumps

  • resolve (bool) – whether to resolve URLs found in catalog dumps or not

Return type

None

get_dump_download_urls()[source]

Get the dump download URLs of a target catalog. Useful if there is a way to compute the latest URLs.

Return type

Optional[List[str]]

Returns

the latest dumps URL

discogs_dump_extractor

Discogs dump extractor.

class soweego.importer.discogs_dump_extractor.DiscogsDumpExtractor[source]

Download Discogs dumps, extract data, and populate a database instance.

extract_and_populate(dump_file_paths, resolve)[source]

Extract relevant data from the artists (people) and masters (works) Discogs dumps, preprocess them, populate SQLAlchemy ORM entities, and persist them to a database instance.

See discogs_entity for the ORM definitions.

Parameters
  • dump_file_paths (List[str]) – paths to downloaded catalog dumps

  • resolve (bool) – whether to resolve URLs found in catalog dumps or not

Return type

None

get_dump_download_urls()[source]

Get the dump download URLs of a target catalog. Useful if there is a way to compute the latest URLs.

Return type

Optional[List[str]]

Returns

the latest dumps URL

imdb_dump_extractor

IMDb dump extractor.

class soweego.importer.imdb_dump_extractor.IMDbDumpExtractor[source]

Download IMDb dumps, extract data, and populate a database instance.

extract_and_populate(dump_file_paths, resolve)[source]

Extract relevant data from the name (people) and title (works) IMDb dumps, preprocess them, populate SQLAlchemy ORM entities, and persist them to a database instance.

See imdb_entity for the ORM definitions.

Parameters
  • dump_file_paths (List[str]) – paths to downloaded catalog dumps

  • resolve (bool) – whether to resolve URLs found in catalog dumps or not

Return type

None

get_dump_download_urls()[source]

Get the dump download URLs of a target catalog. Useful if there is a way to compute the latest URLs.

Return type

List[str]

Returns

the latest dumps URL

musicbrainz_dump_extractor

MusicBrainz dump extractor.

class soweego.importer.musicbrainz_dump_extractor.MusicBrainzDumpExtractor[source]

Download MusicBrainz dumps, extract data, and populate a database instance.

extract_and_populate(dump_file_paths, resolve)[source]

Extract relevant data from the artist (people) and release group (works) MusicBrainz dumps, preprocess them, populate SQLAlchemy ORM entities, and persist them to a database instance.

See musicbrainz_entity for the ORM definitions.

Parameters
  • dump_file_paths (List[str]) – paths to downloaded catalog dumps

  • resolve (bool) – whether to resolve URLs found in catalog dumps or not

get_dump_download_urls()[source]

Get the dump download URLs of a target catalog. Useful if there is a way to compute the latest URLs.

Return type

List[str]

Returns

the latest dumps URL

importer

Download, extract, and import a supported catalog.

class soweego.importer.importer.Importer[source]

Handle a catalog dump: check its freshness and dispatch the appropriate extractor.

refresh_dump(output_folder, extractor, resolve)[source]
Eventually download the latest dump, and call the

corresponding extractor.

Parameters
  • output_folder (str) – a path where the downloaded dumps will be stored

  • extractor (BaseDumpExtractor) – BaseDumpExtractor implementation to process the dump

  • resolve (bool) – whether to resolve URLs found in catalog dumps or not