Notes on the *recordlinkage* library ==================================== General ------- - uses ``pandas`` for data structures, typically the ``DataFrame``, ``Series``, and ``MultiIndex`` classes - - - uses ``jellyfish`` under the hood for edit distances and phonetic algorithms. Data format ----------- - - - uses ``pandas.DataFrame`` to represent datasets. It's basically a table with column headers - conversion from a ``dict`` is easy: key = column header, value = cell - a value is a list, so ``defaultdict(list)`` is helpful :: dataset = pandas.DataFrame( { 'catalog_id': [666, 777, 888], 'name': ['huey', 'dewey', 'louie'], ... } ) - remember the order of values, i.e., ``666`` -> ``'huey'`` Cleaning -------- - AKA **pre-processing** AKA **normalization** AKA **standardization**; - - uses ``pandas.Series``, a list-like object - the ``clean`` function seems interesting at a first glimpse - by default, it **removes text inside brackets**. Might be useful, but also trivial to re-implement - terrible default regex, **removes everything that is not an ASCII letter!** Non-ASCII strings are just deleted! Use a custom regex or ``None`` in ``replace_by_none=`` kwarg to avoid this - nice ASCII folding via ``strip_accents='ascii'``, **not done** by default - ``strip_accents='unicode'`` keeps intact some Unicode chars, e.g., ``œ`` - non-latin scripts are just not handled - the ``phonetic`` function has the same problems as in ``jellyfish``, see `#79 `_. >>> import pandas >>> from recordlinkage.preprocessing import clean >>> names = pandas.Series( >>> [ >>> 'хартшорн, чарльз', >>> 'charles hartshorne', >>> 'チャールズ・ハートショーン', >>> 'تشارلز هارتشورن', >>> '찰스 하츠혼', >>> 'àáâäæãåāèéêëēėęîïíīįìôöòóœøōõûüùúū' >>> ] >>> ) >>> clean(names) 0 1 charles hartshorne 2 3 4 5 dtype: object >>> clean(names, replace_by_none=None, strip_accents='ascii') 0 , 1 charles hartshorne 2 3 4 5 aaaaaaaeeeeeeeiiiiiioooooouuuuu dtype: object Indexing -------- - AKA **blocking** AKA **candidate acquisition** - - make pairs of records to reduce the space complexity (quadratic) - a simple call to the ``Index.block(FIELD)`` function is not enough for names, as it makes pairs that **exactly** agree, i.e., **like an exact match** >>> import recordlinkage >>> index = recordlinkage.Index() >>> index.block('name') >>> candidate_pairs = index.index(source_dataset, target_dataset) - we could inject the MariaDB full-text index `#126 `_ as a **user-defined algorithm**; - - Comparing --------- - AKA **feature extraction** - - probably useful for `#143 `_; - the ```` function can be useful for dates: - the ``Compare.string`` function implements ``jellyfish`` string edit distances and others: - the string **edit distance feature** is **binary**, not **scalar**: ``feature_vectors.sum(1).value_counts()`` below shows that - the ``threshold`` kwarg gives a binary score for pairs above or below its value, i.e., ``1`` or ``0``. **It's not really a threshold** - not clear how the feature is fired by default, i.e., ``threshold=None`` - better always use the ``threshold`` kwarg then, typically ``3`` for Levenshtein and ``0.85`` for Jaro-Winkler >>> import recordlinkage >>> comp = recordlinkage.Compare() >>> comp.string('name', 'label', threshold=3) >>> feature_vectors = comp.compute(candidate_pairs, source_dataset, target_dataset) >>> print(feature_vectors.sum(1).value_counts()) Classification -------------- - train with ``fit(training_feature_vectors, match_index)`` - classify with ``predict(classification_feature_vectors)`` - we could give SVM a try: - adapters are especially useful: - **it is possible to inject a neural network with ``keras``**: - remember to set comparison of fields with missing values to ``0``, i.e., pair disagreement: - *Most classifiers can not handle comparison vectors with missing values.* - no worries, ``compare.string`` does that by default Training workflow ----------------- INPUT = training set = existing QIDs with target IDs = dict ``{ QID: target_ID }``. 1. get the QID statements from Wikidata 2. query MariaDB for target ID data 3. load both into 2 ``pandas.DataFrame`` 4. pre-process 5. make the index with blocking -> ``match_index`` arg 6. feature extraction with comparison -> ``training_feature_vectors`` arg. Naïve Bayes ----------- - - - **code example** at - ``recordlinkage.NaiveBayesClassifier`` class - works with **binary features**, also explains why the edit distance feature is binary - the ``binarize`` kwarg translates into a threshold: features above and below this value become ``1`` and ``0`` respectively - the code example uses ``binary_vectors`` and sets toy ``m`` and ``u`` probabilities: 1. are comparison vectors (point 6 of the training workflow) the expected input? 2. should we compute ``m`` and ``u`` on our own as well?