Run the pipeline

soweego is a pipeline of Python modules by design. Each module can be used alone or combined with others at will.

In this page, you will grasp the typical workflow:

  1. import the dumps of a given target catalog into a SQL database

  2. link the imported catalog to Wikidata

  3. sync Wikidata to the imported catalog

Get set

  1. Install Docker

  2. install MariaDB

  3. create a credentials JSON file like this:

    "DB_ENGINE": "mysql+pymysql",
    "HOST": "${DB_IP_ADDRESS}",
    "USER": "${DB_USER}",
    "TEST_DB": "soweego",
    "PROD_DB": "${DB_NAME}",

WIKIDATA_API_USER and WIKIDATA_API_PASSWORD are optional: set them to run authenticated requests against the Wikidata Web API. If you have a Wikidata bot account, processing will speed up.

soweego’s favourite food is disk space, so make sure you have enough: 20 GB should sate its appetite.


$ git clone
$ cd soweego

${OUTPUT_FOLDER} is a path to a folder on your local filesystem: this is where all soweego output goes. Pick ${CATALOG} from discogs, imdb, or musicbrainz.

This script does not only run soweego, but also takes care of some side tasks:

  • backs up the output folder in a tar ball

  • keeps at most 3 backups

  • empties the output folder

  • pulls the latest soweego master branch. N.B.: this will erase any pending edits in the local git repository




--importer / --no-importer


enable / disable the importer

--linker / --no-linker


enable / disable the linker

--validator / --no-validator


enable / disable the validator

--upload / --no-upload


enable / disable the upload of results to Wikidata

Under the hood

The actual pipeline is implemented in soweego/, so you can also launch it with

python -m soweego run

See The command line and Pipeline for more details.

Cron jobs

soweego periodically runs pipelines for each supported catalog via cron jobs. You can find crontab-ready scripts in the scripts/cron folder. Feel free to reuse them! Just remember to set the appropriate paths.