The command line¶

Note

start your exploration journey of the command line interface (CLI) with

$ python -m soweego

As a reminder, make sure you are inside soweego:

$ cd soweego && ./docker/run.sh
Building soweego

...

root@70c9b4894a30:/app/soweego#

Importer¶

python -m soweego importer
Usage: soweego importer [OPTIONS] COMMAND [ARGS]...

  Import target catalog dumps into a SQL database.

Options:
  --help  Show this message and exit.

Commands:
  check_urls  Check for rotten URLs of an imported catalog.
  import      Download, extract, and import a supported catalog.

check_urls¶

Check for rotten URLs of an imported catalog.

For every catalog entity, dump rotten URLs to a file. CSV format: URL,catalog_ID

Use ‘-d’ to drop rotten URLs from the DB on the fly.

check_urls [OPTIONS] {discogs|imdb|musicbrainz}

Options

-d, --drop¶: Drop rotten URLs from the DB.

--dir-io <dir_io>¶: Input/output directory, default: work.

Arguments

CATALOG¶: Required argument

import¶

Download, extract, and import a supported catalog.

import [OPTIONS] {discogs|imdb|musicbrainz}

Options

--url-check¶: Check for rotten URLs while importing. Default: no. WARNING: this will dramatically increase the import time.

--dir-io <dir_io>¶: Input/output directory, default: work.

Arguments

CATALOG¶: Required argument

Ingester¶

python -m soweego ingester
Usage: soweego ingester [OPTIONS] COMMAND [ARGS]...

  Take soweego output into Wikidata items.

Options:
  --help  Show this message and exit.

Commands:
  delete       Delete invalid identifiers.
  deprecate    Deprecate invalid identifiers.
  identifiers  Add identifiers.
  mnm          Upload matches to the Mix'n'match tool.
  people       Add statements to Wikidata people.
  works        Add statements to Wikidata works.

delete¶

Delete invalid identifiers.

INVALID_IDENTIFIERS must be a JSON file. Format: { catalog_identifier: [ list of QIDs ] }

delete [OPTIONS] {twitter|imdb|discogs|musicbrainz} {writer|actor|musical_work
       |band|producer|musician|audiovisual_work|director} INVALID_IDENTIFIERS

Options

-s, --sandbox¶: Perform all edits on the Wikidata sandbox item Q13406268.

Arguments

CATALOG¶: Required argument

ENTITY¶: Required argument

INVALID_IDENTIFIERS¶: Required argument

deprecate¶

Deprecate invalid identifiers.

INVALID_IDENTIFIERS must be a JSON file. Format: { catalog_identifier: [ list of QIDs ] }

deprecate [OPTIONS] {twitter|imdb|discogs|musicbrainz} {writer|actor|musical_w
          ork|band|producer|musician|audiovisual_work|director}
          INVALID_IDENTIFIERS

Options

-s, --sandbox¶: Perform all edits on the Wikidata sandbox item Q13406268.

Arguments

CATALOG¶: Required argument

ENTITY¶: Required argument

INVALID_IDENTIFIERS¶: Required argument

identifiers¶

Add identifiers.

IDENTIFIERS must be a JSON file. Format: { QID: catalog_identifier }

If the identifier already exists, just add a reference.

Example:

$ echo ‘{ “Q446627”: “266995” }’ > rhell.json

$ python -m soweego ingester identifiers discogs musician rhell.json

Result:

claim (Richard Hell, Discogs artist ID, 266995)

reference (based on heuristic, artificial intelligence), (retrieved, today)

identifiers [OPTIONS] {twitter|imdb|discogs|musicbrainz} {writer|actor|musical
            _work|band|producer|musician|audiovisual_work|director}
            IDENTIFIERS

Options

-s, --sandbox¶: Perform all edits on the Wikidata sandbox item Q13406268.

Arguments

CATALOG¶: Required argument

ENTITY¶: Required argument

IDENTIFIERS¶: Required argument

mnm¶

Upload matches to the Mix’n’match tool.

CONFIDENCE_RANGE must be a pair of floats that indicate the minimum and maximum confidence scores.

MATCHES must be a CSV file path. Format: QID, catalog_identifier, confidence_score

The CSV file can be compressed.

Example:

echo Q446627,266995,0.666 > rhell.csv

python -m soweego ingest mnm discogs musician 0.3 0.7 rhell.csv

Result: see ‘Latest catalogs’ at https://tools.wmflabs.org/mix-n-match/

mnm [OPTIONS] {discogs|twitter|imdb|musicbrainz} {writer|actor|musical_work|ba
    nd|producer|musician|audiovisual_work|director} CONFIDENCE_RANGE...
    MATCHES

Arguments

CATALOG¶: Required argument

ENTITY¶: Required argument

CONFIDENCE_RANGE¶: Required argument(s)

MATCHES¶: Required argument

people¶

Add statements to Wikidata people.

STATEMENTS must be a CSV file. Format: person_QID, PID, value, person_catalog_ID

If the claim already exists, just add a reference.

Example:

$ echo Q312387,P463,Q483407,264375 > joey.csv

$ python -m soweego ingester people discogs joey.csv

Result:

claim (Joey Ramone, member of, Ramones)

reference (based on heuristic, record linkage), (stated in, Discogs), (Discogs artist ID, 264375), (retrieved, today)

people [OPTIONS] {twitter|imdb|discogs|musicbrainz} STATEMENTS

Options

-c, --criterion <criterion>¶

Validation criterion used to generate STATEMENTS. Same as the command passed to python -m soweego sync

Options: links | bio

-s, --sandbox¶: Perform all edits on the Wikidata sandbox item Q13406268.

Arguments

CATALOG¶: Required argument

STATEMENTS¶: Required argument

works¶

Add statements to Wikidata works.

STATEMENTS must be a CSV file. Format: work_QID, PID, person_QID, person_target_ID

If the claim already exists, just add a reference.

Example:

$ echo Q4354548,P175,Q5969,139984 > cmon.csv

$ python -m soweego ingester works discogs cmon.csv

Result:

claim (C’mon Everybody, performer, Eddie Cochran)

reference (based on heuristic, record linkage), (stated in, Discogs), (Discogs artist ID, 139984), (retrieved, today)

works [OPTIONS] {twitter|imdb|discogs|musicbrainz} STATEMENTS

Options

-s, --sandbox¶: Perform all edits on the Wikidata sandbox item Q13406268.

Arguments

CATALOG¶: Required argument

STATEMENTS¶: Required argument

Linker¶

python -m soweego linker
Usage: soweego linker [OPTIONS] COMMAND [ARGS]...

  Link Wikidata items to target catalog identifiers.

Options:
  --help  Show this message and exit.

Commands:
  baseline  Run a rule-based linker.
  evaluate  Evaluate the performance of a supervised linker.
  extract   Extract Wikidata links from a target catalog dump.
  link      Run a supervised linker.
  train     Train a supervised linker.

baseline¶

Run a rule-based linker.

Available rules:

‘perfect’ = perfect match on names

‘links’ = similar match on link tokens

‘names’ = similar match on name tokens

Run all of them by default.

baseline [OPTIONS] {discogs|imdb|musicbrainz} {writer|actor|musical_work|band|
         producer|musician|audiovisual_work|director}

Options

-r, --rule <rule>¶

Activate a specific rule or all of them. Default: all.

Options: perfect | links | names | all

-u, --upload¶: Upload links to Wikidata.

-s, --sandbox¶: Perform all edits on the Wikidata sandbox item Q4115189.

-d, --dir-io <dir_io>¶: Input/output directory, default: work.

--dates, --no-dates¶: Check if dates match, when applicable. Default: yes.

Arguments

CATALOG¶: Required argument

ENTITY¶: Required argument

evaluate¶

Evaluate the performance of a supervised linker.

By default, run 5-fold cross-validation and return averaged performance scores.

evaluate [OPTIONS] {naive_bayes|logistic_regression|support_vector_machines|li
         near_support_vector_machines|random_forest|single_layer_perceptron|mu
         lti_layer_perceptron|voting_classifier|gated_classifier|stacked_class
         ifier|nb|lr|svm|lsvm|rf|slp|mlp|vc|gc|sc} {discogs|imdb|musicbrainz} {
         writer|actor|musical_work|band|producer|musician|audiovisual_work|dir
         ector}

Options

-k, --k-folds <k_folds>¶: Number of folds, default: 5.

-s, --single¶: Compute a single evaluation over all k folds, instead of k evaluations.

-n, --nested¶: Compute a nested cross-validation with hyperparameters tuning via grid search. WARNING: this will take a lot of time.

-m, --metric <metric>¶

Performance metric for nested cross-validation. Use with ‘–nested’. Default: f1.

Options: precision | recall | f1

-d, --dir-io <dir_io>¶: Input/output directory, default: work.

Arguments

CLASSIFIER¶: Required argument

CATALOG¶: Required argument

ENTITY¶: Required argument

extract¶

Extract Wikidata links from a target catalog dump.

extract [OPTIONS] {discogs|imdb|musicbrainz} {writer|actor|musical_work|band|p
        roducer|musician|audiovisual_work|director}

Options

-u, --upload¶: Upload links to Wikidata.

-s, --sandbox¶: Perform all edits on the Wikidata sandbox item Q4115189.

-d, --dir-io <dir_io>¶: Input/output directory, default: work.

Arguments

CATALOG¶: Required argument

ENTITY¶: Required argument

link¶

Run a supervised linker.

Build the classification set relevant to the given catalog and entity, then generate links between Wikidata items and catalog identifiers.

Output a gzipped CSV file, format: QID,catalog_ID,confidence_score

You can pass the ‘-u’ flag to upload the output to Wikidata.

A trained model must exist for the given classifier, catalog, entity. To do so, use:

$ python -m soweego linker train

link [OPTIONS] {naive_bayes|logistic_regression|support_vector_machines|linear
     _support_vector_machines|random_forest|single_layer_perceptron|multi_laye
     r_perceptron|voting_classifier|gated_classifier|stacked_classifier|nb|lr|
     svm|lsvm|rf|slp|mlp|vc|gc|sc} {discogs|imdb|musicbrainz} {writer|actor|mu
     sical_work|band|producer|musician|audiovisual_work|director}

Options

-t, --threshold <threshold>¶: Probability score threshold, default: 0.5.

-n, --name-rule¶: Activate post-classification rule on full names: links with different full names will be filtered.

-u, --upload¶: Upload links to Wikidata.

-s, --sandbox¶: Perform all edits on the Wikidata sandbox item Q4115189.

-d, --dir-io <dir_io>¶: Input/output directory, default: work.

Arguments

CLASSIFIER¶: Required argument

CATALOG¶: Required argument

ENTITY¶: Required argument

train¶

Train a supervised linker.

Build the training set relevant to the given catalog and entity, then train a model with the given classification algorithm.

train [OPTIONS] {naive_bayes|logistic_regression|support_vector_machines|linea
      r_support_vector_machines|random_forest|single_layer_perceptron|multi_la
      yer_perceptron|voting_classifier|gated_classifier|stacked_classifier|nb|
      lr|svm|lsvm|rf|slp|mlp|vc|gc|sc} {discogs|imdb|musicbrainz} {writer|acto
      r|musical_work|band|producer|musician|audiovisual_work|director}

Options

-t, --tune¶: Run grid search for hyperparameters tuning.

-k, --k-folds <k_folds>¶: Number of folds for hyperparameters tuning. Use with ‘–tune’. Default: 5.

-d, --dir-io <dir_io>¶: Input/output directory, default: work.

Arguments

CLASSIFIER¶: Required argument

CATALOG¶: Required argument

ENTITY¶: Required argument

Pipeline¶

python -m soweego run

run¶

Launch the whole pipeline.

run [OPTIONS] {discogs|imdb|musicbrainz}

Options

--validator, --no-validator¶: Sync Wikidata to the target catalog. Default: no.

--importer, --no-importer¶: Import the target catalog dump into the database. Default: yes.

--linker, --no-linker¶: Link Wikidata items to target catalog identifiers. Default: yes.

--upload, --no-upload¶: Upload results to Wikidata. Default: yes.

Arguments

CATALOG¶: Required argument

Validator AKA Sync¶

python -m soweego sync
Usage: soweego sync [OPTIONS] COMMAND [ARGS]...

  Sync Wikidata to target catalogs.

Options:
  --help  Show this message and exit.

Commands:
  bio    Validate identifiers against biographical data.
  ids    Check if identifiers are still alive.
  links  Validate identifiers against links.
  works  Generate statements about works by people.

bio¶

Validate identifiers against biographical data.

Look for birth/death dates, birth/death places, gender.

Dump 4 output files:

1. catalog IDs to be deprecated. JSON format: {catalog_ID: [list of QIDs]}

2. statements to be added. CSV format: QID,PID,value,catalog_ID

3. shared statements to be referenced. Same format as file #2

4. statements found in Wikidata but not in the target catalog. CSV format: catalog_ID,PID_URL,value,QID_URL

You can pass the ‘-u’ flag to upload the output to Wikidata.

bio [OPTIONS] {discogs|imdb|musicbrainz} {writer|actor|musical_work|band|produ
    cer|musician|audiovisual_work|director}

Options

-u, --upload¶: Upload the output to Wikidata.

-s, --sandbox¶: Perform all edits on the Wikidata sandbox item Q13406268.

--dump-wikidata¶: Dump biographical data gathered from Wikidata to a Python pickle.

--dir-io <dir_io>¶: Input/output directory, default: work.

Arguments

CATALOG¶: Required argument

ENTITY¶: Required argument

ids¶

Check if identifiers are still alive.

Dump a JSON file of dead ones. Format: { identifier: [ list of QIDs ] }

Dead identifiers should get a deprecated rank in Wikidata: you can pass the ‘-d’ flag to do so.

ids [OPTIONS] {discogs|imdb|musicbrainz} {writer|actor|musical_work|band|produ
    cer|musician|audiovisual_work|director}

Options

-d, --deprecate¶: Deprecate dead identifiers: this changes their rank in Wikidata.

-s, --sandbox¶: Perform all deprecations on the Wikidata sandbox item Q13406268.

--dump-wikidata¶: Dump identifiers gathered from Wikidata to a Python pickle.

--dir-io <dir_io>¶: Input/output directory, default: work.

Arguments

CATALOG¶: Required argument

ENTITY¶: Required argument

links¶

Validate identifiers against links.

Dump 6 output files:

1. catalog IDs to be deprecated. JSON format: {catalog_ID: [list of QIDs]}

2. third-party IDs to be added. CSV format: QID,third-party_PID,third-party_ID,catalog_ID

3. URLs to be added. CSV format: QID,P2888,URL,catalog_ID

4. third-party IDs to be referenced. Same format as file #2

5. URLs to be referenced. Same format as file #3

6. URLs found in Wikidata but not in the target catalog. CSV format: catalog_ID,URL,QID_URL

You can pass the ‘-u’ flag to upload the output to Wikidata.

The ‘-b’ flag applies a URL blacklist of low-quality Web domains to file #3.

links [OPTIONS] {discogs|imdb|musicbrainz} {writer|actor|musical_work|band|pro
      ducer|musician|audiovisual_work|director}

Options

-b, --blacklist¶: Filter low-quality URLs through a blacklist.

-u, --upload¶: Upload the output to Wikidata.

-s, --sandbox¶: Perform all edits on the Wikidata sandbox item Q13406268.

--dump-wikidata¶: Dump URLs gathered from Wikidata to a Python pickle.

--dir-io <dir_io>¶: Input/output directory, default: work.

Arguments

CATALOG¶: Required argument

ENTITY¶: Required argument

works¶

Generate statements about works by people.

Dump a CSV file of statements. Format: work_QID,PID,person_QID,person_catalog_ID

You can pass the ‘-u’ flag to upload the statements to Wikidata.

works [OPTIONS] {discogs|imdb|musicbrainz} {writer|actor|musical_work|band|pro
      ducer|musician|audiovisual_work|director}

Options

-u, --upload¶: Upload statements to Wikidata.

-s, --sandbox¶: Perform all edits on the Wikidata sandbox item Q4115189.

-d, --dir-io <dir_io>¶: Input/output directory, default: work.

Arguments

CATALOG¶: Required argument

ENTITY¶: Required argument

The command line¶

Importer¶

check_urls¶

import¶

Ingester¶

delete¶

deprecate¶

identifiers¶

mnm¶

people¶

works¶

Linker¶

baseline¶

evaluate¶

extract¶

link¶

train¶

Pipeline¶

run¶

Validator AKA Sync¶

bio¶

ids¶

links¶

works¶

soweego

Navigation

Related Topics