importer
¶
Import target catalog dumps into a SQL database.
base_dump_extractor
¶
Base class for catalog dumps extraction.
- class soweego.importer.base_dump_extractor.BaseDumpExtractor[source]¶
Method definitions to download catalog dumps, extract data, and populate a database instance.
- extract_and_populate(dump_file_paths, resolve)[source]¶
Extract relevant data and populate SQLAlchemy ORM entities accordingly. Entities will be then persisted to a database instance.
discogs_dump_extractor
¶
Discogs dump extractor.
- class soweego.importer.discogs_dump_extractor.DiscogsDumpExtractor[source]¶
Download Discogs dumps, extract data, and populate a database instance.
- extract_and_populate(dump_file_paths, resolve)[source]¶
Extract relevant data from the artists (people) and masters (works) Discogs dumps, preprocess them, populate SQLAlchemy ORM entities, and persist them to a database instance.
See
discogs_entity
for the ORM definitions.
imdb_dump_extractor
¶
IMDb dump extractor.
- class soweego.importer.imdb_dump_extractor.IMDbDumpExtractor[source]¶
Download IMDb dumps, extract data, and populate a database instance.
- extract_and_populate(dump_file_paths, resolve)[source]¶
Extract relevant data from the name (people) and title (works) IMDb dumps, preprocess them, populate SQLAlchemy ORM entities, and persist them to a database instance.
See
imdb_entity
for the ORM definitions.
musicbrainz_dump_extractor
¶
MusicBrainz dump extractor.
- class soweego.importer.musicbrainz_dump_extractor.MusicBrainzDumpExtractor[source]¶
Download MusicBrainz dumps, extract data, and populate a database instance.
- extract_and_populate(dump_file_paths, resolve)[source]¶
Extract relevant data from the artist (people) and release group (works) MusicBrainz dumps, preprocess them, populate SQLAlchemy ORM entities, and persist them to a database instance.
See
musicbrainz_entity
for the ORM definitions.
importer
¶
Download, extract, and import a supported catalog.
- class soweego.importer.importer.Importer[source]¶
Handle a catalog dump: check its freshness and dispatch the appropriate extractor.
- refresh_dump(output_folder, extractor, resolve)[source]¶
- Eventually download the latest dump, and call the
corresponding extractor.
- Parameters
output_folder (
str
) – a path where the downloaded dumps will be storedextractor (
BaseDumpExtractor
) –BaseDumpExtractor
implementation to process the dumpresolve (
bool
) – whether to resolve URLs found in catalog dumps or not