.. soweego documentation master file, created by
sphinx-quickstart on Mon Jun 3 13:12:22 2019.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
soweego: link Wikidata to large catalogs
========================================
.. image:: https://results.pre-commit.ci/badge/github/Wikidata/soweego/master.svg
:target: https://results.pre-commit.ci/latest/github/Wikidata/soweego/master
:alt: pre-commit CI status
.. image:: https://readthedocs.org/projects/soweego/badge/?version=latest
:target: https://soweego.readthedocs.io/en/latest/?badge=latest
:alt: Documentation status
.. image:: https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336
:target: https://pycqa.github.io/isort/
:alt: isort imports
.. image:: https://img.shields.io/github/license/Wikidata/soweego.svg
:target: https://www.gnu.org/licenses/gpl-3.0.html
:alt: License
*soweego* is a pipeline that connects `Wikidata `_ to large-scale third-party catalogs.
*soweego* is the only system that makes *statisticians, epidemiologists, historians,* and *computer scientists* agree.
Why? Because it performs *record linkage, data matching,* and *entity resolution* at the same time.
Too easy, they all seem to be `synonyms `_!
Oh, *soweego* also embeds `Machine Learning `_ and advocates for `Linked Data `_.
Official Project Pages
----------------------
*soweego* is made possible thanks to the `Wikimedia Foundation `_:
- https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego
- https://meta.wikimedia.org/wiki/Grants:Project/Hjfocs/soweego_2
Highlights
----------
- Run the whole :ref:`pipeline `, or
- use the :ref:`command line `;
- :mod:`import ` large catalogs into a SQL database;
- :mod:`gather ` live Wikidata datasets;
- :mod:`connect ` them to target catalogs via *rule-based* and *supervised* linkers;
- :mod:`upload ` links to Wikidata and `Mix'n'match `_;
- :mod:`synchronize ` Wikidata to imported catalogs;
- :mod:`enrich ` Wikidata items with relevant statements.
Get Ready
---------
Install `Docker `_
and `Compose `_,
then enter *soweego*::
$ git clone -b v1.1 https://github.com/Wikidata/soweego.git
$ cd soweego
$ ./docker/run.sh
Building soweego
...
root@70c9b4894a30:/app/soweego#
Now it's too late to get out!
.. _run-the-pipeline:
Run the Pipeline
----------------
Piece of cake:
.. code-block:: text
:/app/soweego# python -m soweego run CATALOG
Pick ``CATALOG`` from ``discogs``, ``imdb``, or ``musicbrainz``.
These steps are executed by default:
1. import the target catalog into a local database;
2. link Wikidata to the target with a supervised linker;
3. synchronize Wikidata to the target.
Results are in ``/app/shared/results``.
.. _use-the-command-line:
Use the Command Line
--------------------
You can launch every single *soweego* action with CLI commands:
.. code-block:: text
:/app/soweego# python -m soweego
Usage: soweego [OPTIONS] COMMAND [ARGS]...
Link Wikidata to large catalogs.
Options:
-l, --log-level ...
Module name followed by one of [DEBUG, INFO,
WARNING, ERROR, CRITICAL]. Multiple pairs
allowed.
--help Show this message and exit.
Commands:
importer Import target catalog dumps into a SQL database.
ingester Take soweego output into Wikidata items.
linker Link Wikidata items to target catalog identifiers.
run Launch the whole pipeline.
sync Sync Wikidata to target catalogs.
Just two things to remember:
1. you can always get ``--help``;
2. each command may have sub-commands.
Find all details in the :ref:`cli_docs`.
How-tos
-------
.. toctree::
:maxdepth: 1
pipeline
new_catalog
dev_prod
.. _cli_docs:
CLI Documentation
-----------------
.. toctree::
:maxdepth: 2
cli
API Documentation
-----------------
.. toctree::
:maxdepth: 2
importer
models
ingester
linker
validator
wikidata
Contribute
----------
.. note:: the best way is to :ref:`new`.
Please also have a look here:
.. toctree::
:maxdepth: 2
contribute
Experiments & notes
-------------------
.. toctree::
:maxdepth: 1
experiments
evaluations
recordlinkage
License
-------
The source code is under the terms of the `GNU General Public License, version 3 `_.