Run the pipeline¶

soweego is a pipeline of Python modules by design. Each module can be used alone or combined with others at will.

In this page, you will grasp the typical workflow:

import the dumps of a given target catalog into a SQL database
link the imported catalog to Wikidata
sync Wikidata to the imported catalog

Get set¶

Install Docker
install MariaDB
create a credentials JSON file like this:

{
    "DB_ENGINE": "mysql+pymysql",
    "HOST": "${DB_IP_ADDRESS}",
    "USER": "${DB_USER}",
    "PASSWORD": "${DB_PASSWORD}",
    "TEST_DB": "soweego",
    "PROD_DB": "${DB_NAME}",
    "WIKIDATA_API_USER": "${WIKI_USER_NAME}",
    "WIKIDATA_API_PASSWORD": "${WIKI_PASSWORD}"
}

WIKIDATA_API_USER and WIKIDATA_API_PASSWORD are optional: set them to run authenticated requests against the Wikidata Web API. If you have a Wikidata bot account, processing will speed up.

soweego’s favourite food is disk space, so make sure you have enough: 20 GB should sate its appetite.

Go¶

$ git clone https://github.com/Wikidata/soweego.git
$ cd soweego
$ ./docker/pipeline.sh -c ${CREDENTIALS_FILE} -s ${OUTPUT_FOLDER} ${CATALOG}

${OUTPUT_FOLDER} is a path to a folder on your local filesystem: this is where all soweego output goes. Pick ${CATALOG} from discogs, imdb, or musicbrainz.

`pipeline.sh`¶

This script does not only run soweego, but also takes care of some side tasks:

backs up the output folder in a tar ball
keeps at most 3 backups
empties the output folder
pulls the latest soweego master branch. N.B.: this will erase any pending edits in the local git repository

Flag	Default	Description
`--importer` / `--no-importer`	enabled	enable / disable the importer
`--linker` / `--no-linker`	enabled	enable / disable the linker
`--validator` / `--no-validator`	enabled	enable / disable the validator
`--upload` / `--no-upload`	disabled	enable / disable the upload of results to Wikidata

Under the hood¶

The actual pipeline is implemented in soweego/pipeline.py, so you can also launch it with

python -m soweego run

See The command line and Pipeline for more details.

Cron jobs¶

soweego periodically runs pipelines for each supported catalog via cron jobs. You can find crontab-ready scripts in the scripts/cron folder. Feel free to reuse them! Just remember to set the appropriate paths.

Run the pipeline¶

Get set¶

Go¶

`pipeline.sh`¶

Under the hood¶

Cron jobs¶

soweego

Navigation

Related Topics

Run the pipeline¶

Get set¶

Go¶

pipeline.sh¶

Under the hood¶

Cron jobs¶

`pipeline.sh`¶