importer¶
Import target catalog dumps into a SQL database.
base_dump_extractor¶
Base class for catalog dumps extraction.
-
class
soweego.importer.base_dump_extractor.BaseDumpExtractor[source]¶ Method definitions to download catalog dumps, extract data, and populate a database instance.
-
extract_and_populate(dump_file_paths, resolve)[source]¶ Extract relevant data and populate SQLAlchemy ORM entities accordingly. Entities will be then persisted to a database instance.
-
discogs_dump_extractor¶
Discogs dump extractor.
-
class
soweego.importer.discogs_dump_extractor.DiscogsDumpExtractor[source]¶ Download Discogs dumps, extract data, and populate a database instance.
-
extract_and_populate(dump_file_paths, resolve)[source]¶ Extract relevant data from the artists (people) and masters (works) Discogs dumps, preprocess them, populate SQLAlchemy ORM entities, and persist them to a database instance.
See
discogs_entityfor the ORM definitions.
-
imdb_dump_extractor¶
IMDb dump extractor.
-
class
soweego.importer.imdb_dump_extractor.IMDbDumpExtractor[source]¶ Download IMDb dumps, extract data, and populate a database instance.
-
extract_and_populate(dump_file_paths, resolve)[source]¶ Extract relevant data from the name (people) and title (works) IMDb dumps, preprocess them, populate SQLAlchemy ORM entities, and persist them to a database instance.
See
imdb_entityfor the ORM definitions.
-
musicbrainz_dump_extractor¶
MusicBrainz dump extractor.
-
class
soweego.importer.musicbrainz_dump_extractor.MusicBrainzDumpExtractor[source]¶ Download MusicBrainz dumps, extract data, and populate a database instance.
-
extract_and_populate(dump_file_paths, resolve)[source]¶ Extract relevant data from the artist (people) and release group (works) MusicBrainz dumps, preprocess them, populate SQLAlchemy ORM entities, and persist them to a database instance.
See
musicbrainz_entityfor the ORM definitions.
-
importer¶
Download, extract, and import a supported catalog.
-
class
soweego.importer.importer.Importer[source]¶ Handle a catalog dump: check its freshness and dispatch the appropriate extractor.
-
refresh_dump(output_folder, extractor, resolve)[source]¶ - Eventually download the latest dump, and call the
corresponding extractor.
- Parameters
output_folder (
str) – a path where the downloaded dumps will be storedextractor (
BaseDumpExtractor) –BaseDumpExtractorimplementation to process the dumpresolve (
bool) – whether to resolve URLs found in catalog dumps or not
-