The command line¶
Note
start your exploration journey of the command line interface (CLI) with
$ python -m soweego
As a reminder, make sure you are inside soweego:
$ cd soweego && ./docker/run.sh
Building soweego
...
root@70c9b4894a30:/app/soweego#
Importer¶
python -m soweego importer
Usage: soweego importer [OPTIONS] COMMAND [ARGS]...
Import target catalog dumps into a SQL database.
Options:
--help Show this message and exit.
Commands:
check_urls Check for rotten URLs of an imported catalog.
import Download, extract, and import a supported catalog.
check_urls¶
Check for rotten URLs of an imported catalog.
check_urls [OPTIONS] [discogs|imdb|musicbrainz]
Arguments
-
CATALOG
¶
Required argument
import¶
Download, extract, and import a supported catalog.
import [OPTIONS] [discogs|imdb|musicbrainz]
Options
-
--url-check
¶
Check for rotten URLs while importing. Default: no. WARNING: this will dramatically increase the import time.
-
-d
,
--dir-io
<dir_io>
¶ Input/output directory, default: /app/shared/.
Arguments
-
CATALOG
¶
Required argument
Ingester¶
python -m soweego ingester
Usage: soweego ingester [OPTIONS] COMMAND [ARGS]...
Take soweego output into Wikidata items.
Options:
--help Show this message and exit.
Commands:
delete Delete invalid identifiers.
deprecate Deprecate invalid identifiers.
identifiers Add identifiers.
mnm Upload matches to the Mix'n'match tool.
people Add statements to Wikidata people.
works Add statements to Wikidata works.
delete¶
Delete invalid identifiers.
INVALID_IDENTIFIERS must be a JSON file. Format: { catalog_identifier: [ list of QIDs ] }
delete [OPTIONS] [twitter|imdb|musicbrainz|discogs] [writer|actor|band|musicia
n|audiovisual_work|producer|director|musical_work] INVALID_IDENTIFIERS
Options
-
-s
,
--sandbox
¶
Perform all edits on the Wikidata sandbox item Q4115189.
Arguments
-
CATALOG
¶
Required argument
-
ENTITY
¶
Required argument
-
INVALID_IDENTIFIERS
¶
Required argument
deprecate¶
Deprecate invalid identifiers.
INVALID_IDENTIFIERS must be a JSON file. Format: { catalog_identifier: [ list of QIDs ] }
deprecate [OPTIONS] [twitter|imdb|musicbrainz|discogs] [writer|actor|band|musi
cian|audiovisual_work|producer|director|musical_work]
INVALID_IDENTIFIERS
Options
-
-s
,
--sandbox
¶
Perform all edits on the Wikidata sandbox item Q4115189.
Arguments
-
CATALOG
¶
Required argument
-
ENTITY
¶
Required argument
-
INVALID_IDENTIFIERS
¶
Required argument
identifiers¶
Add identifiers.
IDENTIFIERS must be a JSON file. Format: { QID: catalog_identifier }
If the identifier already exists, just add a reference.
Example:
$ echo ‘{ “Q446627”: “266995” }’ > rhell.json
$ python -m soweego ingester identifiers discogs musician rhell.json
Result:
claim (Richard Hell, Discogs artist ID, 266995)
- reference (based on heuristic, artificial intelligence),
(retrieved, today)
identifiers [OPTIONS] [twitter|imdb|musicbrainz|discogs] [writer|actor|band|mu
sician|audiovisual_work|producer|director|musical_work]
IDENTIFIERS
Options
-
-s
,
--sandbox
¶
Perform all edits on the Wikidata sandbox item Q4115189.
Arguments
-
CATALOG
¶
Required argument
-
ENTITY
¶
Required argument
-
IDENTIFIERS
¶
Required argument
mnm¶
Upload matches to the Mix’n’match tool.
CONFIDENCE_RANGE must be a pair of floats that indicate the minimum and maximum confidence scores.
MATCHES must be a CSV file path. Format: QID, catalog_identifier, confidence_score
The CSV file can be compressed.
Example:
echo Q446627,266995,0.666 > rhell.csv
python -m soweego ingest mnm discogs musician 0.3 0.7 rhell.csv
Result: see ‘Latest catalogs’ at https://tools.wmflabs.org/mix-n-match/
mnm [OPTIONS] [twitter|discogs|imdb|musicbrainz] [writer|actor|band|musician|a
udiovisual_work|producer|director|musical_work] CONFIDENCE_RANGE...
MATCHES
Arguments
-
CATALOG
¶
Required argument
-
ENTITY
¶
Required argument
-
CONFIDENCE_RANGE
¶
Required argument(s)
-
MATCHES
¶
Required argument
people¶
Add statements to Wikidata people.
STATEMENTS must be a CSV file. Format: person_QID, PID, value
If the claim already exists, just add a reference.
Example:
$ echo Q312387,P463,Q483407 > joey.csv
$ python -m soweego ingester people joey.csv
Result:
claim (Joey Ramone, member of, Ramones)
- reference (based on heuristic, artificial intelligence),
(retrieved, today)
people [OPTIONS] STATEMENTS
Options
-
-s
,
--sandbox
¶
Perform all edits on the Wikidata sandbox item Q4115189.
Arguments
-
STATEMENTS
¶
Required argument
works¶
Add statements to Wikidata works.
STATEMENTS must be a CSV file. Format: work_QID, PID, person_QID, person_target_ID
If the claim already exists, just add a reference.
Example:
$ echo Q4354548,P175,Q5969,139984 > cmon.csv
$ python -m soweego ingester works discogs cmon.csv
Result:
claim (C’mon Everybody, performer, Eddie Cochran)
- reference (based on heuristic, artificial intelligence),
(Discogs artist ID, 139984), (retrieved, today)
works [OPTIONS] [twitter|imdb|musicbrainz|discogs] STATEMENTS
Options
-
-s
,
--sandbox
¶
Perform all edits on the Wikidata sandbox item Q4115189.
Arguments
-
CATALOG
¶
Required argument
-
STATEMENTS
¶
Required argument
Linker¶
python -m soweego linker
Usage: soweego linker [OPTIONS] COMMAND [ARGS]...
Link Wikidata items to target catalog identifiers.
Options:
--help Show this message and exit.
Commands:
baseline Run a rule-based linker.
evaluate Evaluate the performance of a supervised linker.
extract Extract Wikidata links from a target catalog dump.
link Run a supervised linker.
train Train a supervised linker.
baseline¶
Run a rule-based linker.
Available rules:
‘perfect’ = perfect match on names
‘links’ = similar match on link tokens
‘names’ = similar match on name tokens
Run all of them by default.
baseline [OPTIONS] [discogs|imdb|musicbrainz] [writer|actor|band|musician|audi
ovisual_work|producer|director|musical_work]
Options
-
-r
,
--rule
<rule>
¶ Activate a specific rule or all of them. Default: all.
- Options
perfect|links|names|all
-
-u
,
--upload
¶
Upload links to Wikidata.
-
-s
,
--sandbox
¶
Perform all edits on the Wikidata sandbox item Q4115189.
-
-d
,
--dir-io
<dir_io>
¶ Input/output directory, default: /app/shared/.
-
--dates
,
--no-dates
¶
Check if dates match, when applicable. Default: yes.
Arguments
-
CATALOG
¶
Required argument
-
ENTITY
¶
Required argument
evaluate¶
Evaluate the performance of a supervised linker.
By default, run 5-fold cross-validation and return averaged performance scores.
evaluate [OPTIONS] [naive_bayes|logistic_regression|support_vector_machines|li
near_support_vector_machines|random_forest|single_layer_perceptron|mu
lti_layer_perceptron|voting_classifier|gated_classifier|stacked_class
ifier|nb|lr|svm|lsvm|rf|slp|mlp|vc|gc|sc] [discogs|imdb|musicbrainz] [
writer|actor|band|musician|audiovisual_work|producer|director|musical
_work]
Options
-
-k
,
--k-folds
<k_folds>
¶ Number of folds, default: 5.
-
-s
,
--single
¶
Compute a single evaluation over all k folds, instead of k evaluations.
-
-n
,
--nested
¶
Compute a nested cross-validation with hyperparameters tuning via grid search. WARNING: this will take a lot of time.
-
-m
,
--metric
<metric>
¶ Performance metric for nested cross-validation. Use with ‘–nested’. Default: f1.
- Options
precision|recall|f1
-
-d
,
--dir-io
<dir_io>
¶ Input/output directory, default: /app/shared/.
Arguments
-
CLASSIFIER
¶
Required argument
-
CATALOG
¶
Required argument
-
ENTITY
¶
Required argument
extract¶
Extract Wikidata links from a target catalog dump.
extract [OPTIONS] [discogs|imdb|musicbrainz] [writer|actor|band|musician|audio
visual_work|producer|director|musical_work]
Options
-
-u
,
--upload
¶
Upload links to Wikidata.
-
-s
,
--sandbox
¶
Perform all edits on the Wikidata sandbox item Q4115189.
-
-d
,
--dir-io
<dir_io>
¶ Input/output directory, default: /app/shared/.
Arguments
-
CATALOG
¶
Required argument
-
ENTITY
¶
Required argument
link¶
Run a supervised linker.
Build the classification set relevant to the given catalog and entity, then generate links between Wikidata items and catalog identifiers.
Output a gzipped CSV file, format: QID,catalog_ID,confidence_score
You can pass the ‘-u’ flag to upload the output to Wikidata.
A trained model must exist for the given classifier, catalog, entity. To do so, use:
$ python -m soweego linker train
link [OPTIONS] [naive_bayes|logistic_regression|support_vector_machines|linear
_support_vector_machines|random_forest|single_layer_perceptron|multi_laye
r_perceptron|voting_classifier|gated_classifier|stacked_classifier|nb|lr|
svm|lsvm|rf|slp|mlp|vc|gc|sc] [discogs|imdb|musicbrainz] [writer|actor|ba
nd|musician|audiovisual_work|producer|director|musical_work]
Options
-
-t
,
--threshold
<threshold>
¶ Probability score threshold, default: 0.5.
-
-n
,
--name-rule
¶
Activate post-classification rule on full names: links with different full names will be filtered.
-
-u
,
--upload
¶
Upload links to Wikidata.
-
-s
,
--sandbox
¶
Perform all edits on the Wikidata sandbox item Q4115189.
-
-d
,
--dir-io
<dir_io>
¶ Input/output directory, default: /app/shared/.
Arguments
-
CLASSIFIER
¶
Required argument
-
CATALOG
¶
Required argument
-
ENTITY
¶
Required argument
train¶
Train a supervised linker.
Build the training set relevant to the given catalog and entity, then train a model with the given classification algorithm.
train [OPTIONS] [naive_bayes|logistic_regression|support_vector_machines|linea
r_support_vector_machines|random_forest|single_layer_perceptron|multi_la
yer_perceptron|voting_classifier|gated_classifier|stacked_classifier|nb|
lr|svm|lsvm|rf|slp|mlp|vc|gc|sc] [discogs|imdb|musicbrainz] [writer|acto
r|band|musician|audiovisual_work|producer|director|musical_work]
Options
-
-t
,
--tune
¶
Run grid search for hyperparameters tuning.
-
-k
,
--k-folds
<k_folds>
¶ Number of folds for hyperparameters tuning. Use with ‘–tune’. Default: 5.
-
-d
,
--dir-io
<dir_io>
¶ Input/output directory, default: /app/shared/.
Arguments
-
CLASSIFIER
¶
Required argument
-
CATALOG
¶
Required argument
-
ENTITY
¶
Required argument
Pipeline¶
python -m soweego run
run¶
Launch the whole pipeline.
run [OPTIONS] [discogs|imdb|musicbrainz]
Options
-
--validator
,
--no-validator
¶
Sync Wikidata to the target catalog. Default: no.
-
--importer
,
--no-importer
¶
Import the target catalog dump into the database. Default: yes.
-
--linker
,
--no-linker
¶
Link Wikidata items to target catalog identifiers. Default: yes.
-
--upload
,
--no-upload
¶
Upload results to Wikidata. Default: yes.
Arguments
-
CATALOG
¶
Required argument
Validator AKA Sync¶
python -m soweego sync
Usage: soweego sync [OPTIONS] COMMAND [ARGS]...
Sync Wikidata to target catalogs.
Options:
--help Show this message and exit.
Commands:
bio Validate identifiers against biographical data.
ids Check if identifiers are still alive.
links Validate identifiers against links.
works Generate statements about works by people.
bio¶
Validate identifiers against biographical data.
Look for birth/death dates, birth/death places, gender.
Dump 2 output files:
target identifiers to be deprecated. Format: (JSON) {identifier: [list of QIDs]}
statements to be added. Format: (CSV) QID,metadata_PID,value
You can pass the ‘-u’ flag to upload the output to Wikidata.
bio [OPTIONS] [discogs|imdb|musicbrainz] [writer|actor|band|musician|audiovisu
al_work|producer|director|musical_work]
Options
-
-u
,
--upload
¶
Upload the output to Wikidata.
-
-s
,
--sandbox
¶
Perform all edits on the Wikidata sandbox item Q4115189.
-
--dump-wikidata
¶
Dump biographical data gathered from Wikidata to a JSON file.
-
--dir-io
<dir_io>
¶ Input/output directory, default: /app/shared/.
Arguments
-
CATALOG
¶
Required argument
-
ENTITY
¶
Required argument
ids¶
Check if identifiers are still alive.
Dump a JSON file of dead ones. Format: { identifier: [ list of QIDs ] }
Dead identifiers should get a deprecated rank in Wikidata: you can pass the ‘-d’ flag to do so.
ids [OPTIONS] [discogs|imdb|musicbrainz] [writer|actor|band|musician|audiovisu
al_work|producer|director|musical_work]
Options
-
-d
,
--deprecate
¶
Deprecate dead identifiers: this changes their rank in Wikidata.
-
-s
,
--sandbox
¶
Perform all deprecations on the Wikidata sandbox item Q4115189.
-
--dump-wikidata
¶
Dump identifiers gathered from Wikidata to a JSON file.
-
--dir-io
<dir_io>
¶ Input/output directory, default: /app/shared/.
Arguments
-
CATALOG
¶
Required argument
-
ENTITY
¶
Required argument
links¶
Validate identifiers against links.
Dump 3 output files:
target identifiers to be deprecated. Format: (JSON) {identifier: [list of QIDs]}
third-party identifiers to be added. Format: (CSV) QID,identifier_PID,identifier
URLs to be added. Format: (CSV) QID,P973,URL
You can pass the ‘-u’ flag to upload the output to Wikidata.
links [OPTIONS] [discogs|imdb|musicbrainz] [writer|actor|band|musician|audiovi
sual_work|producer|director|musical_work]
Options
-
-u
,
--upload
¶
Upload the output to Wikidata.
-
-s
,
--sandbox
¶
Perform all edits on the Wikidata sandbox item Q4115189.
-
--dump-wikidata
¶
Dump URLs gathered from Wikidata to a JSON file.
-
--dir-io
<dir_io>
¶ Input/output directory, default: /app/shared/.
Arguments
-
CATALOG
¶
Required argument
-
ENTITY
¶
Required argument
works¶
Generate statements about works by people.
Dump a CSV file of statements. Format: work_QID,PID,person_QID,person_catalog_ID
You can pass the ‘-u’ flag to upload the statements to Wikidata.
works [OPTIONS] [discogs|imdb|musicbrainz] [writer|actor|band|musician|audiovi
sual_work|producer|director|musical_work]
Options
-
-u
,
--upload
¶
Upload statements to Wikidata.
-
-s
,
--sandbox
¶
Perform all edits on the Wikidata sandbox item Q4115189.
-
-d
,
--dir-io
<dir_io>
¶ Input/output directory, default: /app/shared/.
Arguments
-
CATALOG
¶
Required argument
-
ENTITY
¶
Required argument