Import a new catalog¶
Five steps:
set up the Development environment
declare the SQLAlchemy Object Relational Mapper (ORM)
implement the catalog Extractor
Note
you will encounter some variables while reading this page
set
${PROJECT_ROOT}to the root directory where soweego livesset
${CATALOG}to the name of the catalog you want to import, likeIMDbset
${ENTITY}to what the catalog is about, likeMusicianorBookthe other ones should be self-explanatory
ORM¶
create a Python file in:
${PROJECT_ROOT}/soweego/importer/models/${CATALOG}_entity.pypaste the code snippet below
set the
${...}variables accordinglyoptional: define catalog-specific attributes
see
TODOin the code snippetjust remember that attribute names must be different from
BaseEntityones, otherwise you would override themdon’t forget their documentation!
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""`${CATALOG} <${CATALOG_HOME_URL}>`_
`SQLAlchemy <https://www.sqlalchemy.org/>`_ ORM entities."""
__author__ = '${YOUR_NAME_HERE}'
__email__ = '${YOUR_EMAIL_HERE}'
__version__ = '1.0'
__license__ = 'GPL-3.0'
__copyright__ = 'Copyleft ${YEAR}, ${YOUR_NAME_HERE}'
from sqlalchemy import Column, String
from soweego.importer.models.base_entity import BaseEntity
${ENTITY}_TABLE = '${CATALOG}_${ENTITY}'
class ${CATALOG}${ENTITY}Entity(BaseEntity):
"""A ${CATALOG} ${ENTITY}.
It comes from the ${CATALOG_DUMP_FILE} dataset.
See the `download page <${CATALOG_DOWNLOAD_URL}>`_.
**Attributes:**
- **birth_place** (string(255)) - a birth place
"""
__tablename__ = ${ENTITY}_TABLE
__mapper_args__ = {
'polymorphic_identity': __tablename__,
'concrete': True
}
# TODO Optional: define catalog-specific attributes here
# For instance:
birth_place = Column(String(255))
Extractor¶
create a Python file in:
${PROJECT_ROOT}/soweego/importer/${CATALOG}_dump_extractor.pypaste the code snippet below
set the
${...}variables accordinglyimplement
BaseDumpExtractormethods:extract_and_populate()should extract instances of your${CATALOG}${ENTITY}Entityfrom relevant catalog dumps and store them in a database. Theextractstep is up to you. For thepopulatestep, see Populate the SQL databaseget_dump_download_urls()should compute the latest list of URLs to download catalog dumps. Tipically, there will be only one, but you never know
still tortured by doubts? Check out
DiscogsDumpExtractor,IMDbDumpExtractor, orMusicBrainzDumpExtractor. You are now doubtless
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""`${CATALOG} <${CATALOG_HOME_URL}>`_
`SQLAlchemy <https://www.sqlalchemy.org/>`_ ORM entities."""
__author__ = '${YOUR_NAME_HERE}'
__email__ = '${YOUR_EMAIL_HERE}'
__version__ = '1.0'
__license__ = 'GPL-3.0'
__copyright__ = 'Copyleft ${YEAR}, ${YOUR_NAME_HERE}'
from soweego.importer.base_dump_extractor import BaseDumpExtractor
class ${CATALOG}DumpExtractor(BaseDumpExtractor):
"""Download ${CATALOG} dumps, extract data, and
populate a database instance.
"""
def extract_and_populate(
self, dump_file_paths: List[str], resolve: bool
) -> None:
# TODO implement!
def get_dump_download_urls(self) -> Optional[List[str]]:
# TODO implement!
Populate the SQL database¶
from sqlalchemy.exc import SQLAlchemyError
from soweego.commons.db_manager import DBManager
from soweego.importer.base_dump_extractor import BaseDumpExtractor
class ${CATALOG}DumpExtractor(BaseDumpExtractor):
def extract_and_populate(
self, dump_file_paths: List[str], resolve: bool
) -> None:
# The `extract` step should build a list of entities
# For instance:
entities = _extract_from(dump_file_paths)
# 1. Get a `DBManager` instance
db_manager = DBManager()
# 2. Drop & recreate database tables
db_manager.drop(${CATALOG}${ENTITY})
db_manager.create(${CATALOG}${ENTITY})
# 3. Create a session, AKA a database transaction
session = db_manager.new_session()
try:
# 4. Add a list of entities to the session
session.bulk_save_objects(entities)
# 5. Commit the session
session.commit()
except SQLAlchemyError as error:
# 6. Handle transaction errors
# For instance: (are you serious? Don't do this)
print(f'There was an error: {error}')
session.rollback()
finally:
session.close()
Set up the CLI to import your catalog¶
add your catalog keys in
${PROJECT_ROOT}/soweego/commons/keys.py
# Supported catalogs
MUSICBRAINZ = 'musicbrainz'
...
${CATALOG} = '${CATALOG}'
# Supported entities
# People
ACTOR = 'actor'
...
${ENTITY} = '${ENTITY}'
include your extractor in the
DUMP_EXTRACTORdictionary of${PROJECT_ROOT}/soweego/importer/importer.py
DUMP_EXTRACTOR = {
keys.MUSICBRAINZ: MusicBrainzDumpExtractor,
...
keys.${CATALOG}: ${CATALOG}DumpExtractor
}
add the Wikidata class QID corresponding to your entity in
${PROJECT_ROOT}/soweego/wikidata/vocabulary.py
# Class QID of supported entities
# People
ACTOR_QID = 'Q33999'
...
${ENTITY}_QID = '${QID}'
include your catalog mapping in the
TARGET_CATALOGSdictionary of${PROJECT_ROOT}/soweego/commons/constants.py
keys.MUSICBRAINZ: {
keys.MUSICIAN: {
keys.CLASS_QID: vocabulary.MUSICIAN_QID,
keys.MAIN_ENTITY: MusicBrainzArtistEntity,
keys.LINK_ENTITY: MusicBrainzArtistLinkEntity,
keys.NLP_ENTITY: None,
keys.RELATIONSHIP_ENTITY: MusicBrainzReleaseGroupArtistRelationship,
keys.WORK_TYPE: keys.MUSICAL_WORK,
},
...
},
keys.${CATALOG}: {
keys.${ENTITY}: {
keys.CLASS_QID: vocabulary.${ENTITY}_QID,
keys.MAIN_ENTITY: ${CATALOG}${ENTITY}Entity,
keys.LINK_ENTITY: None,
keys.NLP_ENTITY: None,
keys.RELATIONSHIP_ENTITY: None,
keys.WORK_TYPE: None,
},
},
Run the importer¶
:/app/soweego# python -m soweego importer import ${CATALOG}