.. soweego documentation master file, created by sphinx-quickstart on Mon Jun 3 13:12:22 2019. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. soweego: link Wikidata to large catalogs ======================================== .. image:: https://results.pre-commit.ci/badge/github/Wikidata/soweego/master.svg :target: https://results.pre-commit.ci/latest/github/Wikidata/soweego/master :alt: pre-commit CI status .. image:: https://readthedocs.org/projects/soweego/badge/?version=latest :target: https://soweego.readthedocs.io/en/latest/?badge=latest :alt: Documentation status .. image:: https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336 :target: https://pycqa.github.io/isort/ :alt: isort imports .. image:: https://img.shields.io/github/license/Wikidata/soweego.svg :target: https://www.gnu.org/licenses/gpl-3.0.html :alt: License *soweego* is a pipeline that connects `Wikidata `_ to large-scale third-party catalogs. *soweego* is the only system that makes *statisticians, epidemiologists, historians,* and *computer scientists* agree. Why? Because it performs *record linkage, data matching,* and *entity resolution* at the same time. Too easy, they all seem to be `synonyms `_! Oh, *soweego* also embeds `Machine Learning `_ and advocates for `Linked Data `_. Official Project Pages ---------------------- *soweego* is made possible thanks to the `Wikimedia Foundation `_: - https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego - https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2 Highlights ---------- - Run the whole :ref:`pipeline `, or - use the :ref:`command line `; - :mod:`import ` large catalogs into a SQL database; - :mod:`gather ` live Wikidata datasets; - :mod:`connect ` them to target catalogs via *rule-based* and *supervised* linkers; - :mod:`upload ` links to Wikidata and `Mix'n'match `_; - :mod:`synchronize ` Wikidata to imported catalogs; - :mod:`enrich ` Wikidata items with relevant statements. Get Ready --------- Install `Docker `_ and `Compose `_, then enter *soweego*:: $ git clone -b v1.1 https://github.com/Wikidata/soweego.git $ cd soweego $ ./docker/run.sh Building soweego ... root@70c9b4894a30:/app/soweego# Now it's too late to get out! .. _run-the-pipeline: Run the Pipeline ---------------- Piece of cake: .. code-block:: text :/app/soweego# python -m soweego run CATALOG Pick ``CATALOG`` from ``discogs``, ``imdb``, or ``musicbrainz``. These steps are executed by default: 1. import the target catalog into a local database; 2. link Wikidata to the target with a supervised linker; 3. synchronize Wikidata to the target. Results are in ``/app/shared/results``. .. _use-the-command-line: Use the Command Line -------------------- You can launch every single *soweego* action with CLI commands: .. code-block:: text :/app/soweego# python -m soweego Usage: soweego [OPTIONS] COMMAND [ARGS]... Link Wikidata to large catalogs. Options: -l, --log-level ... Module name followed by one of [DEBUG, INFO, WARNING, ERROR, CRITICAL]. Multiple pairs allowed. --help Show this message and exit. Commands: importer Import target catalog dumps into a SQL database. ingester Take soweego output into Wikidata items. linker Link Wikidata items to target catalog identifiers. run Launch the whole pipeline. sync Sync Wikidata to target catalogs. Just two things to remember: 1. you can always get ``--help``; 2. each command may have sub-commands. Find all details in the :ref:`cli_docs`. How-tos ------- .. toctree:: :maxdepth: 1 pipeline new_catalog dev_prod .. _cli_docs: CLI Documentation ----------------- .. toctree:: :maxdepth: 2 cli API Documentation ----------------- .. toctree:: :maxdepth: 2 importer models ingester linker validator wikidata Contribute ---------- .. note:: the best way is to :ref:`new`. Please also have a look here: .. toctree:: :maxdepth: 2 contribute Experiments & notes ------------------- .. toctree:: :maxdepth: 1 experiments evaluations recordlinkage License ------- The source code is under the terms of the `GNU General Public License, version 3 `_.