Notes on the recordlinkage library

https://recordlinkage.readthedocs.io/

General

Data format

dataset = pandas.DataFrame(
  {
    'catalog_id': [666, 777, 888],
    'name': ['huey', 'dewey', 'louie'],
    ...
  }
)
  • remember the order of values, i.e., 666 -> 'huey'

Cleaning

  • AKA pre-processing AKA normalization AKA standardization;

  • https://recordlinkage.readthedocs.io/en/latest/ref-preprocessing.html

  • uses pandas.Series, a list-like object

  • the clean function seems interesting at a first glimpse

  • by default, it removes text inside brackets. Might be useful, but also trivial to re-implement

  • terrible default regex, removes everything that is not an ASCII letter! Non-ASCII strings are just deleted! Use a custom regex or None in replace_by_none= kwarg to avoid this

  • nice ASCII folding via strip_accents='ascii', not done by default

  • strip_accents='unicode' keeps intact some Unicode chars, e.g., œ

  • non-latin scripts are just not handled

  • the phonetic function has the same problems as in jellyfish, see #79.

>>> import pandas
>>> from recordlinkage.preprocessing import clean
>>> names = pandas.Series(
>>>  [
>>>    'хартшорн, чарльз',
>>>    'charles hartshorne',
>>>    'チャールズ・ハートショーン',
>>>    'تشارلز هارتشورن',
>>>    '찰스 하츠혼',
>>>    'àáâäæãåāèéêëēėęîïíīįìôöòóœøōõûüùúū'
>>>  ]
>>> )
>>> clean(names)
0
1    charles hartshorne
2
3
4
5
dtype: object
>>> clean(names, replace_by_none=None, strip_accents='ascii')
0                                  ,
1                 charles hartshorne
2
3
4
5    aaaaaaaeeeeeeeiiiiiioooooouuuuu
dtype: object

Indexing

  • AKA blocking AKA candidate acquisition

  • https://recordlinkage.readthedocs.io/en/latest/ref-index.html

  • make pairs of records to reduce the space complexity (quadratic)

  • a simple call to the Index.block(FIELD) function is not enough for names, as it makes pairs that exactly agree, i.e., like an exact match

>>> import recordlinkage
>>> index = recordlinkage.Index()
>>> index.block('name')
>>> candidate_pairs = index.index(source_dataset, target_dataset)

Comparing

>>> import recordlinkage
>>> comp = recordlinkage.Compare()
>>> comp.string('name', 'label', threshold=3)
>>> feature_vectors = comp.compute(candidate_pairs, source_dataset, target_dataset)
>>> print(feature_vectors.sum(1).value_counts())

Classification

Training workflow

INPUT = training set = existing QIDs with target IDs = dict { QID: target_ID }.

  1. get the QID statements from Wikidata

  2. query MariaDB for target ID data

  3. load both into 2 pandas.DataFrame

  4. pre-process

  5. make the index with blocking -> match_index arg

  6. feature extraction with comparison -> training_feature_vectors arg.

Naïve Bayes