Notes on the *recordlinkage* library
====================================

https://recordlinkage.readthedocs.io/


General
-------

-  uses ``pandas`` for data structures, typically the ``DataFrame``,
   ``Series``, and ``MultiIndex`` classes
-  https://pandas.pydata.org/pandas-docs/stable/dsintro.html
-  https://pandas.pydata.org/pandas-docs/stable/advanced.html
-  uses ``jellyfish`` under the hood for edit distances and phonetic
   algorithms.


Data format
-----------

-  https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe
-  https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
-  uses ``pandas.DataFrame`` to represent datasets. It's basically a
   table with column headers
-  conversion from a ``dict`` is easy: key = column header, value =
   cell
-  a value is a list, so ``defaultdict(list)`` is helpful

::

   dataset = pandas.DataFrame(
     {
       'catalog_id': [666, 777, 888],
       'name': ['huey', 'dewey', 'louie'],
       ...
     }
   )

-  remember the order of values, i.e., ``666`` -> ``'huey'``


Cleaning
--------

-  AKA **pre-processing** AKA **normalization** AKA **standardization**;
-  https://recordlinkage.readthedocs.io/en/latest/ref-preprocessing.html
-  uses ``pandas.Series``, a list-like object
-  the ``clean`` function seems interesting at a first glimpse
-  by default, it **removes text inside brackets**.
   Might be useful, but also trivial to re-implement
-  terrible default regex, **removes everything that is not an ASCII
   letter!** Non-ASCII strings are just deleted! Use a custom regex or
   ``None`` in ``replace_by_none=`` kwarg to avoid this
-  nice ASCII folding via ``strip_accents='ascii'``, **not done** by
   default
-  ``strip_accents='unicode'`` keeps intact some Unicode chars, e.g.,
   ``œ``
-  non-latin scripts are just not handled
-  the ``phonetic`` function has the same problems as in ``jellyfish``,
   see `#79 <https://github.com/Wikidata/soweego/issues/79>`_.

>>> import pandas
>>> from recordlinkage.preprocessing import clean
>>> names = pandas.Series(
>>>  [
>>>    'хартшорн, чарльз',
>>>    'charles hartshorne',
>>>    'チャールズ・ハートショーン',
>>>    'تشارلز هارتشورن',
>>>    '찰스 하츠혼',
>>>    'àáâäæãåāèéêëēėęîïíīįìôöòóœøōõûüùúū'
>>>  ]
>>> )
>>> clean(names)
0
1    charles hartshorne
2
3
4
5
dtype: object
>>> clean(names, replace_by_none=None, strip_accents='ascii')
0                                  ,
1                 charles hartshorne
2
3
4
5    aaaaaaaeeeeeeeiiiiiioooooouuuuu
dtype: object


Indexing
--------

-  AKA **blocking** AKA **candidate acquisition**
-  https://recordlinkage.readthedocs.io/en/latest/ref-index.html
-  make pairs of records to reduce the space complexity (quadratic)
-  a simple call to the ``Index.block(FIELD)`` function is not enough
   for names, as it makes pairs that **exactly** agree, i.e., **like an
   exact match**

>>> import recordlinkage
>>> index = recordlinkage.Index()
>>> index.block('name')
>>> candidate_pairs = index.index(source_dataset, target_dataset)

-  we could inject the MariaDB full-text index
   `#126 <https://github.com/Wikidata/soweego/issues/126>`_ as a
   **user-defined algorithm**;
-  https://recordlinkage.readthedocs.io/en/latest/ref-index.html#user-defined-algorithms
-  https://recordlinkage.readthedocs.io/en/latest/ref-index.html#examples


Comparing
---------

-  AKA **feature extraction**
-  https://recordlinkage.readthedocs.io/en/latest/ref-compare.html
-  probably useful for
   `#143 <https://github.com/Wikidata/soweego/issues/143>`_;
-  the ``Compare.date`` function can be useful for dates:
   https://recordlinkage.readthedocs.io/en/latest/ref-compare.html#recordlinkage.compare.Date
-  the ``Compare.string`` function implements ``jellyfish`` string edit
   distances and others:
   https://recordlinkage.readthedocs.io/en/latest/ref-compare.html#recordlinkage.compare.String
-  the string **edit distance feature** is **binary**, not **scalar**:
   ``feature_vectors.sum(1).value_counts()`` below shows that
-  the ``threshold`` kwarg gives a binary score for pairs above or below
   its value, i.e., ``1`` or ``0``. **It's not really a threshold**
-  not clear how the feature is fired by default, i.e.,
   ``threshold=None``
-  better always use the ``threshold`` kwarg then, typically ``3`` for
   Levenshtein and ``0.85`` for Jaro-Winkler

>>> import recordlinkage
>>> comp = recordlinkage.Compare()
>>> comp.string('name', 'label', threshold=3)
>>> feature_vectors = comp.compute(candidate_pairs, source_dataset, target_dataset)
>>> print(feature_vectors.sum(1).value_counts())


Classification
--------------

-  train with ``fit(training_feature_vectors, match_index)``
-  classify with ``predict(classification_feature_vectors)``
-  we could give SVM a try:
   https://recordlinkage.readthedocs.io/en/latest/notebooks/classifiers.html#Support-Vector-Machines
-  adapters are especially useful:
   https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html#adapters
-  **it is possible to inject a neural network with ``keras``**:
   https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html#recordlinkage.adapters.KerasAdapter
-  remember to set comparison of fields with missing values to ``0``,
   i.e., pair disagreement:

   -  *Most classifiers can not handle comparison vectors with missing
      values.*
   -  no worries, ``compare.string`` does that by default


Training workflow
-----------------

INPUT = training set = existing QIDs with target IDs = dict
``{ QID: target_ID }``.

1. get the QID statements from Wikidata
2. query MariaDB for target ID data
3. load both into 2 ``pandas.DataFrame``
4. pre-process
5. make the index with blocking -> ``match_index`` arg
6. feature extraction with comparison -> ``training_feature_vectors``
   arg.


Naïve Bayes
-----------

-  https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html#recordlinkage.NaiveBayesClassifier
-  https://recordlinkage.readthedocs.io/en/latest/notebooks/classifiers.html
-  **code example** at
   https://github.com/J535D165/recordlinkage/blob/master/examples/supervised_learning_prob.py
-  ``recordlinkage.NaiveBayesClassifier`` class
-  works with **binary features**, also explains why the edit distance
   feature is binary
-  the ``binarize`` kwarg translates into a threshold:
   features above and below this value become ``1`` and ``0`` respectively
-  the code example uses ``binary_vectors`` and sets toy ``m`` and ``u``
   probabilities:

   1. are comparison vectors (point 6 of the training workflow) the
      expected input?
   2. should we compute ``m`` and ``u`` on our own as well?