soweego: link Wikidata to large catalogs¶

pre-commit CI status Documentation status isort imports License

soweego is a pipeline that connects Wikidata to large-scale third-party catalogs.

soweego is the only system that makes statisticians, epidemiologists, historians, and computer scientists agree. Why? Because it performs record linkage, data matching, and entity resolution at the same time. Too easy, they all seem to be synonyms!

Oh, soweego also embeds Machine Learning and advocates for Linked Data.

Official Project Pages¶

soweego is made possible thanks to the Wikimedia Foundation:

  • https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego

  • https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2

Highlights¶

  • Run the whole pipeline, or

  • use the command line;

  • import large catalogs into a SQL database;

  • gather live Wikidata datasets;

  • connect them to target catalogs via rule-based and supervised linkers;

  • upload links to Wikidata and Mix’n’match;

  • synchronize Wikidata to imported catalogs;

  • enrich Wikidata items with relevant statements.

Get Ready¶

Install Docker and Compose, then enter soweego:

$ git clone -b v1.1 https://github.com/Wikidata/soweego.git
$ cd soweego
$ ./docker/run.sh
Building soweego
...

root@70c9b4894a30:/app/soweego#

Now it’s too late to get out!

Run the Pipeline¶

Piece of cake:

:/app/soweego# python -m soweego run CATALOG

Pick CATALOG from discogs, imdb, or musicbrainz.

These steps are executed by default:

  1. import the target catalog into a local database;

  2. link Wikidata to the target with a supervised linker;

  3. synchronize Wikidata to the target.

Results are in /app/shared/results.

Use the Command Line¶

You can launch every single soweego action with CLI commands:

:/app/soweego# python -m soweego
Usage: soweego [OPTIONS] COMMAND [ARGS]...

  Link Wikidata to large catalogs.

Options:
  -l, --log-level <TEXT CHOICE>...
                           Module name followed by one of [DEBUG, INFO,
                           WARNING, ERROR, CRITICAL]. Multiple pairs
                           allowed.
  --help                   Show this message and exit.

Commands:
  importer  Import target catalog dumps into a SQL database.
  ingester  Take soweego output into Wikidata items.
  linker    Link Wikidata items to target catalog identifiers.
  run       Launch the whole pipeline.
  sync      Sync Wikidata to target catalogs.

Just two things to remember:

  1. you can always get --help;

  2. each command may have sub-commands.

Find all details in the CLI Documentation.

How-tos¶

  • Run the pipeline
  • Import a new catalog
  • Development and production environments

CLI Documentation¶

  • The command line
    • Importer
    • Ingester
    • Linker
    • Pipeline
    • Validator AKA Sync

API Documentation¶

  • importer
    • base_dump_extractor
    • discogs_dump_extractor
    • imdb_dump_extractor
    • musicbrainz_dump_extractor
    • importer
  • models
    • base_entity
    • base_link_entity
    • base_nlp_entity
    • discogs_entity
    • imdb_entity
    • musicbrainz_entity
    • mix_n_match
  • ingester
    • wikidata_bot
    • mix_n_match_client
  • linker
    • workflow
    • blocking
    • features
    • classifiers
    • train
    • link
    • evaluate
  • validator
    • checks
    • enrichment
  • wikidata
    • api_requests
    • sparql_queries

Contribute¶

Note

the best way is to Import a new catalog.

Please also have a look here:

  • Contribution guidelines
    • Workflow
    • Coding

Experiments & notes¶

  • Experiments
  • Evaluations
  • Notes on the recordlinkage library

License¶

The source code is under the terms of the GNU General Public License, version 3.

Logo

soweego

Link Wikidata to large catalogs

Navigation

  • Run the pipeline
  • Import a new catalog
  • Development and production environments
  • The command line
  • importer
  • models
  • ingester
  • linker
  • validator
  • wikidata
  • Contribution guidelines
  • Experiments
  • Evaluations
  • Notes on the recordlinkage library

Related Topics

  • Documentation overview
    • Next: Run the pipeline

Quick search

©MMXIX-present, Marco Fossati. A Wikimedia Foundation project. | Page source
Fork me on GitHub