The command line¶
Note
start your exploration journey of the command line interface (CLI) with
$ python -m soweego
As a reminder, make sure you are inside soweego:
$ cd soweego && ./docker/run.sh
Building soweego
...
root@70c9b4894a30:/app/soweego#
Importer¶
python -m soweego importer
Usage: soweego importer [OPTIONS] COMMAND [ARGS]...
Import target catalog dumps into a SQL database.
Options:
--help Show this message and exit.
Commands:
check_urls Check for rotten URLs of an imported catalog.
import Download, extract, and import a supported catalog.
check_urls¶
Check for rotten URLs of an imported catalog.
For every catalog entity, dump rotten URLs to a file. CSV format: URL,catalog_ID
Use ‘-d’ to drop rotten URLs from the DB on the fly.
check_urls [OPTIONS] {discogs|imdb|musicbrainz}
Options
- -d, --drop¶
Drop rotten URLs from the DB.
- --dir-io <dir_io>¶
Input/output directory, default: work.
Arguments
- CATALOG¶
Required argument
import¶
Download, extract, and import a supported catalog.
import [OPTIONS] {discogs|imdb|musicbrainz}
Options
- --url-check¶
Check for rotten URLs while importing. Default: no. WARNING: this will dramatically increase the import time.
- --dir-io <dir_io>¶
Input/output directory, default: work.
Arguments
- CATALOG¶
Required argument
Ingester¶
python -m soweego ingester
Usage: soweego ingester [OPTIONS] COMMAND [ARGS]...
Take soweego output into Wikidata items.
Options:
--help Show this message and exit.
Commands:
delete Delete invalid identifiers.
deprecate Deprecate invalid identifiers.
identifiers Add identifiers.
mnm Upload matches to the Mix'n'match tool.
people Add statements to Wikidata people.
works Add statements to Wikidata works.
delete¶
Delete invalid identifiers.
INVALID_IDENTIFIERS must be a JSON file. Format: { catalog_identifier: [ list of QIDs ] }
delete [OPTIONS] {twitter|imdb|discogs|musicbrainz} {writer|actor|musical_work
|band|producer|musician|audiovisual_work|director} INVALID_IDENTIFIERS
Options
- -s, --sandbox¶
Perform all edits on the Wikidata sandbox item Q13406268.
Arguments
- CATALOG¶
Required argument
- ENTITY¶
Required argument
- INVALID_IDENTIFIERS¶
Required argument
deprecate¶
Deprecate invalid identifiers.
INVALID_IDENTIFIERS must be a JSON file. Format: { catalog_identifier: [ list of QIDs ] }
deprecate [OPTIONS] {twitter|imdb|discogs|musicbrainz} {writer|actor|musical_w
ork|band|producer|musician|audiovisual_work|director}
INVALID_IDENTIFIERS
Options
- -s, --sandbox¶
Perform all edits on the Wikidata sandbox item Q13406268.
Arguments
- CATALOG¶
Required argument
- ENTITY¶
Required argument
- INVALID_IDENTIFIERS¶
Required argument
identifiers¶
Add identifiers.
IDENTIFIERS must be a JSON file. Format: { QID: catalog_identifier }
If the identifier already exists, just add a reference.
Example:
$ echo ‘{ “Q446627”: “266995” }’ > rhell.json
$ python -m soweego ingester identifiers discogs musician rhell.json
Result:
claim (Richard Hell, Discogs artist ID, 266995)
reference (based on heuristic, artificial intelligence), (retrieved, today)
identifiers [OPTIONS] {twitter|imdb|discogs|musicbrainz} {writer|actor|musical
_work|band|producer|musician|audiovisual_work|director}
IDENTIFIERS
Options
- -s, --sandbox¶
Perform all edits on the Wikidata sandbox item Q13406268.
Arguments
- CATALOG¶
Required argument
- ENTITY¶
Required argument
- IDENTIFIERS¶
Required argument
mnm¶
Upload matches to the Mix’n’match tool.
CONFIDENCE_RANGE must be a pair of floats that indicate the minimum and maximum confidence scores.
MATCHES must be a CSV file path. Format: QID, catalog_identifier, confidence_score
The CSV file can be compressed.
Example:
echo Q446627,266995,0.666 > rhell.csv
python -m soweego ingest mnm discogs musician 0.3 0.7 rhell.csv
Result: see ‘Latest catalogs’ at https://tools.wmflabs.org/mix-n-match/
mnm [OPTIONS] {discogs|twitter|imdb|musicbrainz} {writer|actor|musical_work|ba
nd|producer|musician|audiovisual_work|director} CONFIDENCE_RANGE...
MATCHES
Arguments
- CATALOG¶
Required argument
- ENTITY¶
Required argument
- CONFIDENCE_RANGE¶
Required argument(s)
- MATCHES¶
Required argument
people¶
Add statements to Wikidata people.
STATEMENTS must be a CSV file. Format: person_QID, PID, value, person_catalog_ID
If the claim already exists, just add a reference.
Example:
$ echo Q312387,P463,Q483407,264375 > joey.csv
$ python -m soweego ingester people discogs joey.csv
Result:
claim (Joey Ramone, member of, Ramones)
reference (based on heuristic, record linkage), (stated in, Discogs), (Discogs artist ID, 264375), (retrieved, today)
people [OPTIONS] {twitter|imdb|discogs|musicbrainz} STATEMENTS
Options
- -c, --criterion <criterion>¶
Validation criterion used to generate STATEMENTS. Same as the command passed to python -m soweego sync
- Options
links | bio
- -s, --sandbox¶
Perform all edits on the Wikidata sandbox item Q13406268.
Arguments
- CATALOG¶
Required argument
- STATEMENTS¶
Required argument
works¶
Add statements to Wikidata works.
STATEMENTS must be a CSV file. Format: work_QID, PID, person_QID, person_target_ID
If the claim already exists, just add a reference.
Example:
$ echo Q4354548,P175,Q5969,139984 > cmon.csv
$ python -m soweego ingester works discogs cmon.csv
Result:
claim (C’mon Everybody, performer, Eddie Cochran)
reference (based on heuristic, record linkage), (stated in, Discogs), (Discogs artist ID, 139984), (retrieved, today)
works [OPTIONS] {twitter|imdb|discogs|musicbrainz} STATEMENTS
Options
- -s, --sandbox¶
Perform all edits on the Wikidata sandbox item Q13406268.
Arguments
- CATALOG¶
Required argument
- STATEMENTS¶
Required argument
Linker¶
python -m soweego linker
Usage: soweego linker [OPTIONS] COMMAND [ARGS]...
Link Wikidata items to target catalog identifiers.
Options:
--help Show this message and exit.
Commands:
baseline Run a rule-based linker.
evaluate Evaluate the performance of a supervised linker.
extract Extract Wikidata links from a target catalog dump.
link Run a supervised linker.
train Train a supervised linker.
baseline¶
Run a rule-based linker.
Available rules:
‘perfect’ = perfect match on names
‘links’ = similar match on link tokens
‘names’ = similar match on name tokens
Run all of them by default.
baseline [OPTIONS] {discogs|imdb|musicbrainz} {writer|actor|musical_work|band|
producer|musician|audiovisual_work|director}
Options
- -r, --rule <rule>¶
Activate a specific rule or all of them. Default: all.
- Options
perfect | links | names | all
- -u, --upload¶
Upload links to Wikidata.
- -s, --sandbox¶
Perform all edits on the Wikidata sandbox item Q4115189.
- -d, --dir-io <dir_io>¶
Input/output directory, default: work.
- --dates, --no-dates¶
Check if dates match, when applicable. Default: yes.
Arguments
- CATALOG¶
Required argument
- ENTITY¶
Required argument
evaluate¶
Evaluate the performance of a supervised linker.
By default, run 5-fold cross-validation and return averaged performance scores.
evaluate [OPTIONS] {naive_bayes|logistic_regression|support_vector_machines|li
near_support_vector_machines|random_forest|single_layer_perceptron|mu
lti_layer_perceptron|voting_classifier|gated_classifier|stacked_class
ifier|nb|lr|svm|lsvm|rf|slp|mlp|vc|gc|sc} {discogs|imdb|musicbrainz} {
writer|actor|musical_work|band|producer|musician|audiovisual_work|dir
ector}
Options
- -k, --k-folds <k_folds>¶
Number of folds, default: 5.
- -s, --single¶
Compute a single evaluation over all k folds, instead of k evaluations.
- -n, --nested¶
Compute a nested cross-validation with hyperparameters tuning via grid search. WARNING: this will take a lot of time.
- -m, --metric <metric>¶
Performance metric for nested cross-validation. Use with ‘–nested’. Default: f1.
- Options
precision | recall | f1
- -d, --dir-io <dir_io>¶
Input/output directory, default: work.
Arguments
- CLASSIFIER¶
Required argument
- CATALOG¶
Required argument
- ENTITY¶
Required argument
extract¶
Extract Wikidata links from a target catalog dump.
extract [OPTIONS] {discogs|imdb|musicbrainz} {writer|actor|musical_work|band|p
roducer|musician|audiovisual_work|director}
Options
- -u, --upload¶
Upload links to Wikidata.
- -s, --sandbox¶
Perform all edits on the Wikidata sandbox item Q4115189.
- -d, --dir-io <dir_io>¶
Input/output directory, default: work.
Arguments
- CATALOG¶
Required argument
- ENTITY¶
Required argument
link¶
Run a supervised linker.
Build the classification set relevant to the given catalog and entity, then generate links between Wikidata items and catalog identifiers.
Output a gzipped CSV file, format: QID,catalog_ID,confidence_score
You can pass the ‘-u’ flag to upload the output to Wikidata.
A trained model must exist for the given classifier, catalog, entity. To do so, use:
$ python -m soweego linker train
link [OPTIONS] {naive_bayes|logistic_regression|support_vector_machines|linear
_support_vector_machines|random_forest|single_layer_perceptron|multi_laye
r_perceptron|voting_classifier|gated_classifier|stacked_classifier|nb|lr|
svm|lsvm|rf|slp|mlp|vc|gc|sc} {discogs|imdb|musicbrainz} {writer|actor|mu
sical_work|band|producer|musician|audiovisual_work|director}
Options
- -t, --threshold <threshold>¶
Probability score threshold, default: 0.5.
- -n, --name-rule¶
Activate post-classification rule on full names: links with different full names will be filtered.
- -u, --upload¶
Upload links to Wikidata.
- -s, --sandbox¶
Perform all edits on the Wikidata sandbox item Q4115189.
- -d, --dir-io <dir_io>¶
Input/output directory, default: work.
Arguments
- CLASSIFIER¶
Required argument
- CATALOG¶
Required argument
- ENTITY¶
Required argument
train¶
Train a supervised linker.
Build the training set relevant to the given catalog and entity, then train a model with the given classification algorithm.
train [OPTIONS] {naive_bayes|logistic_regression|support_vector_machines|linea
r_support_vector_machines|random_forest|single_layer_perceptron|multi_la
yer_perceptron|voting_classifier|gated_classifier|stacked_classifier|nb|
lr|svm|lsvm|rf|slp|mlp|vc|gc|sc} {discogs|imdb|musicbrainz} {writer|acto
r|musical_work|band|producer|musician|audiovisual_work|director}
Options
- -t, --tune¶
Run grid search for hyperparameters tuning.
- -k, --k-folds <k_folds>¶
Number of folds for hyperparameters tuning. Use with ‘–tune’. Default: 5.
- -d, --dir-io <dir_io>¶
Input/output directory, default: work.
Arguments
- CLASSIFIER¶
Required argument
- CATALOG¶
Required argument
- ENTITY¶
Required argument
Pipeline¶
python -m soweego run
run¶
Launch the whole pipeline.
run [OPTIONS] {discogs|imdb|musicbrainz}
Options
- --validator, --no-validator¶
Sync Wikidata to the target catalog. Default: no.
- --importer, --no-importer¶
Import the target catalog dump into the database. Default: yes.
- --linker, --no-linker¶
Link Wikidata items to target catalog identifiers. Default: yes.
- --upload, --no-upload¶
Upload results to Wikidata. Default: yes.
Arguments
- CATALOG¶
Required argument
Validator AKA Sync¶
python -m soweego sync
Usage: soweego sync [OPTIONS] COMMAND [ARGS]...
Sync Wikidata to target catalogs.
Options:
--help Show this message and exit.
Commands:
bio Validate identifiers against biographical data.
ids Check if identifiers are still alive.
links Validate identifiers against links.
works Generate statements about works by people.
bio¶
Validate identifiers against biographical data.
Look for birth/death dates, birth/death places, gender.
Dump 4 output files:
1. catalog IDs to be deprecated. JSON format: {catalog_ID: [list of QIDs]}
2. statements to be added. CSV format: QID,PID,value,catalog_ID
3. shared statements to be referenced. Same format as file #2
4. statements found in Wikidata but not in the target catalog. CSV format: catalog_ID,PID_URL,value,QID_URL
You can pass the ‘-u’ flag to upload the output to Wikidata.
bio [OPTIONS] {discogs|imdb|musicbrainz} {writer|actor|musical_work|band|produ
cer|musician|audiovisual_work|director}
Options
- -u, --upload¶
Upload the output to Wikidata.
- -s, --sandbox¶
Perform all edits on the Wikidata sandbox item Q13406268.
- --dump-wikidata¶
Dump biographical data gathered from Wikidata to a Python pickle.
- --dir-io <dir_io>¶
Input/output directory, default: work.
Arguments
- CATALOG¶
Required argument
- ENTITY¶
Required argument
ids¶
Check if identifiers are still alive.
Dump a JSON file of dead ones. Format: { identifier: [ list of QIDs ] }
Dead identifiers should get a deprecated rank in Wikidata: you can pass the ‘-d’ flag to do so.
ids [OPTIONS] {discogs|imdb|musicbrainz} {writer|actor|musical_work|band|produ
cer|musician|audiovisual_work|director}
Options
- -d, --deprecate¶
Deprecate dead identifiers: this changes their rank in Wikidata.
- -s, --sandbox¶
Perform all deprecations on the Wikidata sandbox item Q13406268.
- --dump-wikidata¶
Dump identifiers gathered from Wikidata to a Python pickle.
- --dir-io <dir_io>¶
Input/output directory, default: work.
Arguments
- CATALOG¶
Required argument
- ENTITY¶
Required argument
links¶
Validate identifiers against links.
Dump 6 output files:
1. catalog IDs to be deprecated. JSON format: {catalog_ID: [list of QIDs]}
2. third-party IDs to be added. CSV format: QID,third-party_PID,third-party_ID,catalog_ID
3. URLs to be added. CSV format: QID,P2888,URL,catalog_ID
4. third-party IDs to be referenced. Same format as file #2
5. URLs to be referenced. Same format as file #3
6. URLs found in Wikidata but not in the target catalog. CSV format: catalog_ID,URL,QID_URL
You can pass the ‘-u’ flag to upload the output to Wikidata.
The ‘-b’ flag applies a URL blacklist of low-quality Web domains to file #3.
links [OPTIONS] {discogs|imdb|musicbrainz} {writer|actor|musical_work|band|pro
ducer|musician|audiovisual_work|director}
Options
- -b, --blacklist¶
Filter low-quality URLs through a blacklist.
- -u, --upload¶
Upload the output to Wikidata.
- -s, --sandbox¶
Perform all edits on the Wikidata sandbox item Q13406268.
- --dump-wikidata¶
Dump URLs gathered from Wikidata to a Python pickle.
- --dir-io <dir_io>¶
Input/output directory, default: work.
Arguments
- CATALOG¶
Required argument
- ENTITY¶
Required argument
works¶
Generate statements about works by people.
Dump a CSV file of statements. Format: work_QID,PID,person_QID,person_catalog_ID
You can pass the ‘-u’ flag to upload the statements to Wikidata.
works [OPTIONS] {discogs|imdb|musicbrainz} {writer|actor|musical_work|band|pro
ducer|musician|audiovisual_work|director}
Options
- -u, --upload¶
Upload statements to Wikidata.
- -s, --sandbox¶
Perform all edits on the Wikidata sandbox item Q4115189.
- -d, --dir-io <dir_io>¶
Input/output directory, default: work.
Arguments
- CATALOG¶
Required argument
- ENTITY¶
Required argument