The command line

Note

start your exploration journey of the command line interface (CLI) with

$ python -m soweego

As a reminder, make sure you are inside soweego:

$ cd soweego && ./docker/run.sh
Building soweego

...

root@70c9b4894a30:/app/soweego#

Importer

python -m soweego importer
Usage: soweego importer [OPTIONS] COMMAND [ARGS]...

  Import target catalog dumps into a SQL database.

Options:
  --help  Show this message and exit.

Commands:
  check_urls  Check for rotten URLs of an imported catalog.
  import      Download, extract, and import a supported catalog.

check_urls

Check for rotten URLs of an imported catalog.

For every catalog entity, dump rotten URLs to a file. CSV format: URL,catalog_ID

Use ‘-d’ to drop rotten URLs from the DB on the fly.

check_urls [OPTIONS] {discogs|imdb|musicbrainz}

Options

-d, --drop

Drop rotten URLs from the DB.

--dir-io <dir_io>

Input/output directory, default: work.

Arguments

CATALOG

Required argument

import

Download, extract, and import a supported catalog.

import [OPTIONS] {discogs|imdb|musicbrainz}

Options

--url-check

Check for rotten URLs while importing. Default: no. WARNING: this will dramatically increase the import time.

--dir-io <dir_io>

Input/output directory, default: work.

Arguments

CATALOG

Required argument

Ingester

python -m soweego ingester
Usage: soweego ingester [OPTIONS] COMMAND [ARGS]...

  Take soweego output into Wikidata items.

Options:
  --help  Show this message and exit.

Commands:
  delete       Delete invalid identifiers.
  deprecate    Deprecate invalid identifiers.
  identifiers  Add identifiers.
  mnm          Upload matches to the Mix'n'match tool.
  people       Add statements to Wikidata people.
  works        Add statements to Wikidata works.

delete

Delete invalid identifiers.

INVALID_IDENTIFIERS must be a JSON file. Format: { catalog_identifier: [ list of QIDs ] }

delete [OPTIONS] {twitter|imdb|discogs|musicbrainz} {writer|actor|musical_work
       |band|producer|musician|audiovisual_work|director} INVALID_IDENTIFIERS

Options

-s, --sandbox

Perform all edits on the Wikidata sandbox item Q13406268.

Arguments

CATALOG

Required argument

ENTITY

Required argument

INVALID_IDENTIFIERS

Required argument

deprecate

Deprecate invalid identifiers.

INVALID_IDENTIFIERS must be a JSON file. Format: { catalog_identifier: [ list of QIDs ] }

deprecate [OPTIONS] {twitter|imdb|discogs|musicbrainz} {writer|actor|musical_w
          ork|band|producer|musician|audiovisual_work|director}
          INVALID_IDENTIFIERS

Options

-s, --sandbox

Perform all edits on the Wikidata sandbox item Q13406268.

Arguments

CATALOG

Required argument

ENTITY

Required argument

INVALID_IDENTIFIERS

Required argument

identifiers

Add identifiers.

IDENTIFIERS must be a JSON file. Format: { QID: catalog_identifier }

If the identifier already exists, just add a reference.

Example:

$ echo ‘{ “Q446627”: “266995” }’ > rhell.json

$ python -m soweego ingester identifiers discogs musician rhell.json

Result:

claim (Richard Hell, Discogs artist ID, 266995)

reference (based on heuristic, artificial intelligence), (retrieved, today)

identifiers [OPTIONS] {twitter|imdb|discogs|musicbrainz} {writer|actor|musical
            _work|band|producer|musician|audiovisual_work|director}
            IDENTIFIERS

Options

-s, --sandbox

Perform all edits on the Wikidata sandbox item Q13406268.

Arguments

CATALOG

Required argument

ENTITY

Required argument

IDENTIFIERS

Required argument

mnm

Upload matches to the Mix’n’match tool.

CONFIDENCE_RANGE must be a pair of floats that indicate the minimum and maximum confidence scores.

MATCHES must be a CSV file path. Format: QID, catalog_identifier, confidence_score

The CSV file can be compressed.

Example:

echo Q446627,266995,0.666 > rhell.csv

python -m soweego ingest mnm discogs musician 0.3 0.7 rhell.csv

Result: see ‘Latest catalogs’ at https://tools.wmflabs.org/mix-n-match/

mnm [OPTIONS] {discogs|twitter|imdb|musicbrainz} {writer|actor|musical_work|ba
    nd|producer|musician|audiovisual_work|director} CONFIDENCE_RANGE...
    MATCHES

Arguments

CATALOG

Required argument

ENTITY

Required argument

CONFIDENCE_RANGE

Required argument(s)

MATCHES

Required argument

people

Add statements to Wikidata people.

STATEMENTS must be a CSV file. Format: person_QID, PID, value, person_catalog_ID

If the claim already exists, just add a reference.

Example:

$ echo Q312387,P463,Q483407,264375 > joey.csv

$ python -m soweego ingester people discogs joey.csv

Result:

claim (Joey Ramone, member of, Ramones)

reference (based on heuristic, record linkage), (stated in, Discogs), (Discogs artist ID, 264375), (retrieved, today)

people [OPTIONS] {twitter|imdb|discogs|musicbrainz} STATEMENTS

Options

-c, --criterion <criterion>

Validation criterion used to generate STATEMENTS. Same as the command passed to python -m soweego sync

Options

links | bio

-s, --sandbox

Perform all edits on the Wikidata sandbox item Q13406268.

Arguments

CATALOG

Required argument

STATEMENTS

Required argument

works

Add statements to Wikidata works.

STATEMENTS must be a CSV file. Format: work_QID, PID, person_QID, person_target_ID

If the claim already exists, just add a reference.

Example:

$ echo Q4354548,P175,Q5969,139984 > cmon.csv

$ python -m soweego ingester works discogs cmon.csv

Result:

claim (C’mon Everybody, performer, Eddie Cochran)

reference (based on heuristic, record linkage), (stated in, Discogs), (Discogs artist ID, 139984), (retrieved, today)

works [OPTIONS] {twitter|imdb|discogs|musicbrainz} STATEMENTS

Options

-s, --sandbox

Perform all edits on the Wikidata sandbox item Q13406268.

Arguments

CATALOG

Required argument

STATEMENTS

Required argument

Linker

python -m soweego linker
Usage: soweego linker [OPTIONS] COMMAND [ARGS]...

  Link Wikidata items to target catalog identifiers.

Options:
  --help  Show this message and exit.

Commands:
  baseline  Run a rule-based linker.
  evaluate  Evaluate the performance of a supervised linker.
  extract   Extract Wikidata links from a target catalog dump.
  link      Run a supervised linker.
  train     Train a supervised linker.

baseline

Run a rule-based linker.

Available rules:

‘perfect’ = perfect match on names

‘links’ = similar match on link tokens

‘names’ = similar match on name tokens

Run all of them by default.

baseline [OPTIONS] {discogs|imdb|musicbrainz} {writer|actor|musical_work|band|
         producer|musician|audiovisual_work|director}

Options

-r, --rule <rule>

Activate a specific rule or all of them. Default: all.

Options

perfect | links | names | all

-u, --upload

Upload links to Wikidata.

-s, --sandbox

Perform all edits on the Wikidata sandbox item Q4115189.

-d, --dir-io <dir_io>

Input/output directory, default: work.

--dates, --no-dates

Check if dates match, when applicable. Default: yes.

Arguments

CATALOG

Required argument

ENTITY

Required argument

evaluate

Evaluate the performance of a supervised linker.

By default, run 5-fold cross-validation and return averaged performance scores.

evaluate [OPTIONS] {naive_bayes|logistic_regression|support_vector_machines|li
         near_support_vector_machines|random_forest|single_layer_perceptron|mu
         lti_layer_perceptron|voting_classifier|gated_classifier|stacked_class
         ifier|nb|lr|svm|lsvm|rf|slp|mlp|vc|gc|sc} {discogs|imdb|musicbrainz} {
         writer|actor|musical_work|band|producer|musician|audiovisual_work|dir
         ector}

Options

-k, --k-folds <k_folds>

Number of folds, default: 5.

-s, --single

Compute a single evaluation over all k folds, instead of k evaluations.

-n, --nested

Compute a nested cross-validation with hyperparameters tuning via grid search. WARNING: this will take a lot of time.

-m, --metric <metric>

Performance metric for nested cross-validation. Use with ‘–nested’. Default: f1.

Options

precision | recall | f1

-d, --dir-io <dir_io>

Input/output directory, default: work.

Arguments

CLASSIFIER

Required argument

CATALOG

Required argument

ENTITY

Required argument

extract

Extract Wikidata links from a target catalog dump.

extract [OPTIONS] {discogs|imdb|musicbrainz} {writer|actor|musical_work|band|p
        roducer|musician|audiovisual_work|director}

Options

-u, --upload

Upload links to Wikidata.

-s, --sandbox

Perform all edits on the Wikidata sandbox item Q4115189.

-d, --dir-io <dir_io>

Input/output directory, default: work.

Arguments

CATALOG

Required argument

ENTITY

Required argument

train

Train a supervised linker.

Build the training set relevant to the given catalog and entity, then train a model with the given classification algorithm.

train [OPTIONS] {naive_bayes|logistic_regression|support_vector_machines|linea
      r_support_vector_machines|random_forest|single_layer_perceptron|multi_la
      yer_perceptron|voting_classifier|gated_classifier|stacked_classifier|nb|
      lr|svm|lsvm|rf|slp|mlp|vc|gc|sc} {discogs|imdb|musicbrainz} {writer|acto
      r|musical_work|band|producer|musician|audiovisual_work|director}

Options

-t, --tune

Run grid search for hyperparameters tuning.

-k, --k-folds <k_folds>

Number of folds for hyperparameters tuning. Use with ‘–tune’. Default: 5.

-d, --dir-io <dir_io>

Input/output directory, default: work.

Arguments

CLASSIFIER

Required argument

CATALOG

Required argument

ENTITY

Required argument

Pipeline

python -m soweego run

run

Launch the whole pipeline.

run [OPTIONS] {discogs|imdb|musicbrainz}

Options

--validator, --no-validator

Sync Wikidata to the target catalog. Default: no.

--importer, --no-importer

Import the target catalog dump into the database. Default: yes.

--linker, --no-linker

Link Wikidata items to target catalog identifiers. Default: yes.

--upload, --no-upload

Upload results to Wikidata. Default: yes.

Arguments

CATALOG

Required argument

Validator AKA Sync

python -m soweego sync
Usage: soweego sync [OPTIONS] COMMAND [ARGS]...

  Sync Wikidata to target catalogs.

Options:
  --help  Show this message and exit.

Commands:
  bio    Validate identifiers against biographical data.
  ids    Check if identifiers are still alive.
  links  Validate identifiers against links.
  works  Generate statements about works by people.

bio

Validate identifiers against biographical data.

Look for birth/death dates, birth/death places, gender.

Dump 4 output files:

1. catalog IDs to be deprecated. JSON format: {catalog_ID: [list of QIDs]}

2. statements to be added. CSV format: QID,PID,value,catalog_ID

3. shared statements to be referenced. Same format as file #2

4. statements found in Wikidata but not in the target catalog. CSV format: catalog_ID,PID_URL,value,QID_URL

You can pass the ‘-u’ flag to upload the output to Wikidata.

bio [OPTIONS] {discogs|imdb|musicbrainz} {writer|actor|musical_work|band|produ
    cer|musician|audiovisual_work|director}

Options

-u, --upload

Upload the output to Wikidata.

-s, --sandbox

Perform all edits on the Wikidata sandbox item Q13406268.

--dump-wikidata

Dump biographical data gathered from Wikidata to a Python pickle.

--dir-io <dir_io>

Input/output directory, default: work.

Arguments

CATALOG

Required argument

ENTITY

Required argument

ids

Check if identifiers are still alive.

Dump a JSON file of dead ones. Format: { identifier: [ list of QIDs ] }

Dead identifiers should get a deprecated rank in Wikidata: you can pass the ‘-d’ flag to do so.

ids [OPTIONS] {discogs|imdb|musicbrainz} {writer|actor|musical_work|band|produ
    cer|musician|audiovisual_work|director}

Options

-d, --deprecate

Deprecate dead identifiers: this changes their rank in Wikidata.

-s, --sandbox

Perform all deprecations on the Wikidata sandbox item Q13406268.

--dump-wikidata

Dump identifiers gathered from Wikidata to a Python pickle.

--dir-io <dir_io>

Input/output directory, default: work.

Arguments

CATALOG

Required argument

ENTITY

Required argument

works

Generate statements about works by people.

Dump a CSV file of statements. Format: work_QID,PID,person_QID,person_catalog_ID

You can pass the ‘-u’ flag to upload the statements to Wikidata.

works [OPTIONS] {discogs|imdb|musicbrainz} {writer|actor|musical_work|band|pro
      ducer|musician|audiovisual_work|director}

Options

-u, --upload

Upload statements to Wikidata.

-s, --sandbox

Perform all edits on the Wikidata sandbox item Q4115189.

-d, --dir-io <dir_io>

Input/output directory, default: work.

Arguments

CATALOG

Required argument

ENTITY

Required argument