The command line

Note

start your exploration journey of the command line interface (CLI) with

$ python -m soweego

As a reminder, make sure you are inside soweego:

$ cd soweego && ./docker/run.sh
Building soweego

...

[email protected]:/app/soweego#

Importer

python -m soweego importer
Usage: soweego importer [OPTIONS] COMMAND [ARGS]...

  Import target catalog dumps into a SQL database.

Options:
  --help  Show this message and exit.

Commands:
  check_urls  Check for rotten URLs of an imported catalog.
  import      Download, extract, and import a supported catalog.

check_urls

Check for rotten URLs of an imported catalog.

check_urls [OPTIONS] [discogs|imdb|musicbrainz]

Arguments

CATALOG

Required argument

import

Download, extract, and import a supported catalog.

import [OPTIONS] [discogs|imdb|musicbrainz]

Options

--url-check

Check for rotten URLs while importing. Default: no. WARNING: this will dramatically increase the import time.

-d, --dir-io <dir_io>

Input/output directory, default: /app/shared/.

Arguments

CATALOG

Required argument

Ingester

python -m soweego ingester
Usage: soweego ingester [OPTIONS] COMMAND [ARGS]...

  Take soweego output into Wikidata items.

Options:
  --help  Show this message and exit.

Commands:
  delete       Delete invalid identifiers.
  deprecate    Deprecate invalid identifiers.
  identifiers  Add identifiers.
  mnm          Upload matches to the Mix'n'match tool.
  people       Add statements to Wikidata people.
  works        Add statements to Wikidata works.

mnm

Upload matches to the Mix’n’match tool.

CONFIDENCE_RANGE must be a pair of floats that indicate the minimum and maximum confidence scores.

MATCHES must be a CSV file path. Format: QID, catalog_identifier, confidence_score

The CSV file can be compressed.

Example:

echo Q446627,266995,0.666 > rhell.csv

python -m soweego ingest mnm discogs musician 0.3 0.7 rhell.csv

Result: see ‘Latest catalogs’ at https://tools.wmflabs.org/mix-n-match/

mnm [OPTIONS] [imdb|discogs|musicbrainz|twitter] [band|writer|producer|directo
    r|musician|audiovisual_work|musical_work|actor] CONFIDENCE_RANGE...
    MATCHES

Arguments

CATALOG

Required argument

ENTITY

Required argument

CONFIDENCE_RANGE

Required argument(s)

MATCHES

Required argument

Linker

python -m soweego linker
Usage: soweego linker [OPTIONS] COMMAND [ARGS]...

  Link Wikidata items to target catalog identifiers.

Options:
  --help  Show this message and exit.

Commands:
  baseline  Run a rule-based linker.
  evaluate  Evaluate the performance of a supervised linker.
  extract   Extract Wikidata links from a target catalog dump.
  link      Run a supervised linker.
  train     Train a supervised linker.

evaluate

Evaluate the performance of a supervised linker.

By default, run 5-fold cross-validation and return averaged performance scores.

evaluate [OPTIONS] [naive_bayes|logistic_regression|support_vector_machines|li
         near_support_vector_machines|random_forest|single_layer_perceptron|mu
         lti_layer_perceptron|voting_classifier|gated_classifier|stacked_class
         ifier|nb|lr|svm|lsvm|rf|slp|mlp|vc|gc|sc] [discogs|imdb|musicbrainz] [
         band|writer|producer|director|musician|audiovisual_work|musical_work|
         actor]

Options

-k, --k-folds <k_folds>

Number of folds, default: 5.

-s, --single

Compute a single evaluation over all k folds, instead of k evaluations.

-n, --nested

Compute a nested cross-validation with hyperparameters tuning via grid search. WARNING: this will take a lot of time.

-m, --metric <metric>

Performance metric for nested cross-validation. Use with ‘–nested’. Default: f1.

Options

precision|recall|f1

-d, --dir-io <dir_io>

Input/output directory, default: /app/shared/.

Arguments

CLASSIFIER

Required argument

CATALOG

Required argument

ENTITY

Required argument

train

Train a supervised linker.

Build the training set relevant to the given catalog and entity, then train a model with the given classification algorithm.

train [OPTIONS] [naive_bayes|logistic_regression|support_vector_machines|linea
      r_support_vector_machines|random_forest|single_layer_perceptron|multi_la
      yer_perceptron|voting_classifier|gated_classifier|stacked_classifier|nb|
      lr|svm|lsvm|rf|slp|mlp|vc|gc|sc] [discogs|imdb|musicbrainz] [band|writer
      |producer|director|musician|audiovisual_work|musical_work|actor]

Options

-t, --tune

Run grid search for hyperparameters tuning.

-k, --k-folds <k_folds>

Number of folds for hyperparameters tuning. Use with ‘–tune’. Default: 5.

-d, --dir-io <dir_io>

Input/output directory, default: /app/shared/.

Arguments

CLASSIFIER

Required argument

CATALOG

Required argument

ENTITY

Required argument

Pipeline

python -m soweego run

Validator AKA Sync

python -m soweego sync
Usage: soweego sync [OPTIONS] COMMAND [ARGS]...

  Sync Wikidata to target catalogs.

Options:
  --help  Show this message and exit.

Commands:
  bio    Validate identifiers against biographical data.
  ids    Check if identifiers are still alive.
  links  Validate identifiers against links.
  works  Generate statements about works by people.