Import a new catalog¶
Five steps:
set up the Development environment
declare the SQLAlchemy Object Relational Mapper (ORM)
implement the catalog Extractor
Note
you will encounter some variables while reading this page
set
${PROJECT_ROOT}
to the root directory where soweego livesset
${CATALOG}
to the name of the catalog you want to import, likeIMDb
set
${ENTITY}
to what the catalog is about, likeMusician
orBook
the other ones should be self-explanatory
ORM¶
create a Python file in:
${PROJECT_ROOT}/soweego/importer/models/${CATALOG}_entity.py
paste the code snippet below
set the
${...}
variables accordinglyoptional: define catalog-specific attributes
see
TODO
in the code snippetjust remember that attribute names must be different from
BaseEntity
ones, otherwise you would override themdon’t forget their documentation!
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""`${CATALOG} <${CATALOG_HOME_URL}>`_
`SQLAlchemy <https://www.sqlalchemy.org/>`_ ORM entities."""
__author__ = '${YOUR_NAME_HERE}'
__email__ = '${YOUR_EMAIL_HERE}'
__version__ = '1.0'
__license__ = 'GPL-3.0'
__copyright__ = 'Copyleft ${YEAR}, ${YOUR_NAME_HERE}'
from sqlalchemy import Column, String
from soweego.importer.models.base_entity import BaseEntity
${ENTITY}_TABLE = '${CATALOG}_${ENTITY}'
class ${CATALOG}${ENTITY}Entity(BaseEntity):
"""A ${CATALOG} ${ENTITY}.
It comes from the ${CATALOG_DUMP_FILE} dataset.
See the `download page <${CATALOG_DOWNLOAD_URL}>`_.
**Attributes:**
- **birth_place** (string(255)) - a birth place
"""
__tablename__ = ${ENTITY}_TABLE
__mapper_args__ = {
'polymorphic_identity': __tablename__,
'concrete': True
}
# TODO Optional: define catalog-specific attributes here
# For instance:
birth_place = Column(String(255))
Extractor¶
create a Python file in:
${PROJECT_ROOT}/soweego/importer/${CATALOG}_dump_extractor.py
paste the code snippet below
set the
${...}
variables accordinglyimplement
BaseDumpExtractor
methods:extract_and_populate()
should extract instances of your${CATALOG}${ENTITY}Entity
from relevant catalog dumps and store them in a database. Theextract
step is up to you. For thepopulate
step, see Populate the SQL databaseget_dump_download_urls()
should compute the latest list of URLs to download catalog dumps. Tipically, there will be only one, but you never know
still tortured by doubts? Check out
DiscogsDumpExtractor
,IMDbDumpExtractor
, orMusicBrainzDumpExtractor
. You are now doubtless
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""`${CATALOG} <${CATALOG_HOME_URL}>`_
`SQLAlchemy <https://www.sqlalchemy.org/>`_ ORM entities."""
__author__ = '${YOUR_NAME_HERE}'
__email__ = '${YOUR_EMAIL_HERE}'
__version__ = '1.0'
__license__ = 'GPL-3.0'
__copyright__ = 'Copyleft ${YEAR}, ${YOUR_NAME_HERE}'
from soweego.importer.base_dump_extractor import BaseDumpExtractor
class ${CATALOG}DumpExtractor(BaseDumpExtractor):
"""Download ${CATALOG} dumps, extract data, and
populate a database instance.
"""
def extract_and_populate(
self, dump_file_paths: List[str], resolve: bool
) -> None:
# TODO implement!
def get_dump_download_urls(self) -> Optional[List[str]]:
# TODO implement!
Populate the SQL database¶
from sqlalchemy.exc import SQLAlchemyError
from soweego.commons.db_manager import DBManager
from soweego.importer.base_dump_extractor import BaseDumpExtractor
class ${CATALOG}DumpExtractor(BaseDumpExtractor):
def extract_and_populate(
self, dump_file_paths: List[str], resolve: bool
) -> None:
# The `extract` step should build a list of entities
# For instance:
entities = _extract_from(dump_file_paths)
# 1. Get a `DBManager` instance
db_manager = DBManager()
# 2. Drop & recreate database tables
db_manager.drop(${CATALOG}${ENTITY})
db_manager.create(${CATALOG}${ENTITY})
# 3. Create a session, AKA a database transaction
session = db_manager.new_session()
try:
# 4. Add a list of entities to the session
session.bulk_save_objects(entities)
# 5. Commit the session
session.commit()
except SQLAlchemyError as error:
# 6. Handle transaction errors
# For instance: (are you serious? Don't do this)
print(f'There was an error: {error}')
session.rollback()
finally:
session.close()
Set up the CLI to import your catalog¶
add your catalog keys in
${PROJECT_ROOT}/soweego/commons/keys.py
# Supported catalogs
MUSICBRAINZ = 'musicbrainz'
...
${CATALOG} = '${CATALOG}'
# Supported entities
# People
ACTOR = 'actor'
...
${ENTITY} = '${ENTITY}'
include your extractor in the
DUMP_EXTRACTOR
dictionary of${PROJECT_ROOT}/soweego/importer/importer.py
DUMP_EXTRACTOR = {
keys.MUSICBRAINZ: MusicBrainzDumpExtractor,
...
keys.${CATALOG}: ${CATALOG}DumpExtractor
}
add the Wikidata class QID corresponding to your entity in
${PROJECT_ROOT}/soweego/wikidata/vocabulary.py
# Class QID of supported entities
# People
ACTOR_QID = 'Q33999'
...
${ENTITY}_QID = '${QID}'
include your catalog mapping in the
TARGET_CATALOGS
dictionary of${PROJECT_ROOT}/soweego/commons/constants.py
keys.MUSICBRAINZ: {
keys.MUSICIAN: {
keys.CLASS_QID: vocabulary.MUSICIAN_QID,
keys.MAIN_ENTITY: MusicBrainzArtistEntity,
keys.LINK_ENTITY: MusicBrainzArtistLinkEntity,
keys.NLP_ENTITY: None,
keys.RELATIONSHIP_ENTITY: MusicBrainzReleaseGroupArtistRelationship,
keys.WORK_TYPE: keys.MUSICAL_WORK,
},
...
},
keys.${CATALOG}: {
keys.${ENTITY}: {
keys.CLASS_QID: vocabulary.${ENTITY}_QID,
keys.MAIN_ENTITY: ${CATALOG}${ENTITY}Entity,
keys.LINK_ENTITY: None,
keys.NLP_ENTITY: None,
keys.RELATIONSHIP_ENTITY: None,
keys.WORK_TYPE: None,
},
},
Run the importer¶
:/app/soweego# python -m soweego importer import ${CATALOG}