Notes on the recordlinkage library¶
https://recordlinkage.readthedocs.io/
General¶
uses
pandas
for data structures, typically theDataFrame
,Series
, andMultiIndex
classesuses
jellyfish
under the hood for edit distances and phonetic algorithms.
Data format¶
https://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
uses
pandas.DataFrame
to represent datasets. It’s basically a table with column headersconversion from a
dict
is easy: key = column header, value = cella value is a list, so
defaultdict(list)
is helpful
dataset = pandas.DataFrame(
{
'catalog_id': [666, 777, 888],
'name': ['huey', 'dewey', 'louie'],
...
}
)
remember the order of values, i.e.,
666
->'huey'
Cleaning¶
AKA pre-processing AKA normalization AKA standardization;
https://recordlinkage.readthedocs.io/en/latest/ref-preprocessing.html
uses
pandas.Series
, a list-like objectthe
clean
function seems interesting at a first glimpseby default, it removes text inside brackets. Might be useful, but also trivial to re-implement
terrible default regex, removes everything that is not an ASCII letter! Non-ASCII strings are just deleted! Use a custom regex or
None
inreplace_by_none=
kwarg to avoid thisnice ASCII folding via
strip_accents='ascii'
, not done by defaultstrip_accents='unicode'
keeps intact some Unicode chars, e.g.,œ
non-latin scripts are just not handled
the
phonetic
function has the same problems as injellyfish
, see #79.
>>> import pandas
>>> from recordlinkage.preprocessing import clean
>>> names = pandas.Series(
>>> [
>>> 'хартшорн, чарльз',
>>> 'charles hartshorne',
>>> 'チャールズ・ハートショーン',
>>> 'تشارلز هارتشورن',
>>> '찰스 하츠혼',
>>> 'àáâäæãåāèéêëēėęîïíīįìôöòóœøōõûüùúū'
>>> ]
>>> )
>>> clean(names)
0
1 charles hartshorne
2
3
4
5
dtype: object
>>> clean(names, replace_by_none=None, strip_accents='ascii')
0 ,
1 charles hartshorne
2
3
4
5 aaaaaaaeeeeeeeiiiiiioooooouuuuu
dtype: object
Indexing¶
AKA blocking AKA candidate acquisition
https://recordlinkage.readthedocs.io/en/latest/ref-index.html
make pairs of records to reduce the space complexity (quadratic)
a simple call to the
Index.block(FIELD)
function is not enough for names, as it makes pairs that exactly agree, i.e., like an exact match
>>> import recordlinkage
>>> index = recordlinkage.Index()
>>> index.block('name')
>>> candidate_pairs = index.index(source_dataset, target_dataset)
we could inject the MariaDB full-text index #126 as a user-defined algorithm;
https://recordlinkage.readthedocs.io/en/latest/ref-index.html#user-defined-algorithms
https://recordlinkage.readthedocs.io/en/latest/ref-index.html#examples
Comparing¶
AKA feature extraction
https://recordlinkage.readthedocs.io/en/latest/ref-compare.html
probably useful for #143;
the
Compare.date
function can be useful for dates: https://recordlinkage.readthedocs.io/en/latest/ref-compare.html#recordlinkage.compare.Datethe
Compare.string
function implementsjellyfish
string edit distances and others: https://recordlinkage.readthedocs.io/en/latest/ref-compare.html#recordlinkage.compare.Stringthe string edit distance feature is binary, not scalar:
feature_vectors.sum(1).value_counts()
below shows thatthe
threshold
kwarg gives a binary score for pairs above or below its value, i.e.,1
or0
. It’s not really a thresholdnot clear how the feature is fired by default, i.e.,
threshold=None
better always use the
threshold
kwarg then, typically3
for Levenshtein and0.85
for Jaro-Winkler
>>> import recordlinkage
>>> comp = recordlinkage.Compare()
>>> comp.string('name', 'label', threshold=3)
>>> feature_vectors = comp.compute(candidate_pairs, source_dataset, target_dataset)
>>> print(feature_vectors.sum(1).value_counts())
Classification¶
train with
fit(training_feature_vectors, match_index)
classify with
predict(classification_feature_vectors)
we could give SVM a try: https://recordlinkage.readthedocs.io/en/latest/notebooks/classifiers.html#Support-Vector-Machines
adapters are especially useful: https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html#adapters
it is possible to inject a neural network with ``keras``: https://recordlinkage.readthedocs.io/en/latest/ref-classifiers.html#recordlinkage.adapters.KerasAdapter
remember to set comparison of fields with missing values to
0
, i.e., pair disagreement:Most classifiers can not handle comparison vectors with missing values.
no worries,
compare.string
does that by default
Training workflow¶
INPUT = training set = existing QIDs with target IDs = dict
{ QID: target_ID }
.
get the QID statements from Wikidata
query MariaDB for target ID data
load both into 2
pandas.DataFrame
pre-process
make the index with blocking ->
match_index
argfeature extraction with comparison ->
training_feature_vectors
arg.
Naïve Bayes¶
https://recordlinkage.readthedocs.io/en/latest/notebooks/classifiers.html
code example at https://github.com/J535D165/recordlinkage/blob/master/examples/supervised_learning_prob.py
recordlinkage.NaiveBayesClassifier
classworks with binary features, also explains why the edit distance feature is binary
the
binarize
kwarg translates into a threshold: features above and below this value become1
and0
respectivelythe code example uses
binary_vectors
and sets toym
andu
probabilities:are comparison vectors (point 6 of the training workflow) the expected input?
should we compute
m
andu
on our own as well?