linker

This is soweego’s core, where Wikidata items get linked to target catalog identifiers.

workflow

Record linkage workflow. It is a pipeline composed of the following main steps:

  1. build the Wikidata (build_wikidata()) and target (build_target()) datasets

  2. preprocess both (preprocess_wikidata() and preprocess_target())

  3. extract features by comparing pairs of Wikidata and target values (extract_features())

soweego.linker.workflow.build_target(goal, catalog, entity, identifiers)[source]

Build a target catalog dataset for training or classification purposes: workflow step 1.

Data is gathered by querying the s51434__mixnmatch_large_catalogs_p database. This is where the importer inserts processed catalog dumps.

The database is located in ToolsDB under the Wikimedia Toolforge infrastructure. See how to connect.

Parameters
  • goal (str) – {'training', 'classification'}. Whether to build a dataset for training or classification

  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity

  • identifiers (Set[str]) – a set of catalog IDs to gather data for

Return type

Iterator[DataFrame]

Returns

the generator yielding pandas.DataFrame chunks

soweego.linker.workflow.build_wikidata(goal, catalog, entity, dir_io)[source]

Build a Wikidata dataset for training or classification purposes: workflow step 1.

Data is gathered from the SPARQL endpoint and the Web API.

How it works:

  1. gather relevant Wikidata items that hold (for training) or lack (for classification) identifiers of the given catalog

  2. gather relevant item data

  3. dump the dataset to a gzipped JSON Lines file

  4. read the dataset into a generator of pandas.DataFrame chunks for memory-efficient processing

Parameters
  • goal (str) – {'training', 'classification'}. Whether to build a dataset for training or classification

  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity

  • dir_io (str) – input/output directory where working files will be read/written

Return type

JsonReader

Returns

the generator yielding pandas.DataFrame chunks

soweego.linker.workflow.extract_features(candidate_pairs, wikidata, target, path_io)[source]

Extract feature vectors by comparing pairs of (Wikidata, target catalog) records.

Main features:

  • exact match on full names and URLs

  • match on tokenized names, URLs, and genres

  • Levenshtein distance on name tokens

  • string kernel similarity on name tokens

  • weighted intersection on name tokens

  • match on dates by maximum shared precision

  • cosine similarity on textual descriptions

  • match on occupation QIDs

See features for more details.

This function uses multithreaded parallel processing.

Parameters
  • candidate_pairs (MultiIndex) – an index of (QID, target ID) pairs that should undergo comparison

  • wikidata (DataFrame) – a preprocessed Wikidata dataset (typically a chunk)

  • target (DataFrame) – a preprocessed target catalog dataset (typically a chunk)

  • path_io (str) – input/output path to an extracted feature file

Return type

DataFrame

Returns

the feature vectors dataset

soweego.linker.workflow.preprocess_target(goal, target_reader)[source]

Preprocess a target catalog dataset: workflow step 2.

This function consumes pandas.DataFrame chunks and should be pipelined after build_target().

Preprocessing actions:

  1. drop unneeded columns holding target DB primary keys

  2. rename non-null catalog ID columns & drop others

  3. drop columns with null values only

  4. pair dates with their precision and drop precision columns when applicable

  5. aggregate denormalized data on target ID

  6. (shared with preprocess_wikidata() ) normalize columns with names, occupations, dates, when applicable

Parameters
  • goal (str) – {'training', 'classification'}. Whether the dataset is for training or classification

  • target_reader (Iterator[DataFrame]) – a dataset reader as returned by build_target()

Return type

DataFrame

Returns

the generator yielding preprocessed pandas.DataFrame chunks

soweego.linker.workflow.preprocess_wikidata(goal, wikidata_reader)[source]

Preprocess a Wikidata dataset: workflow step 2.

This function consumes pandas.DataFrame chunks and should be pipelined after build_wikidata().

Preprocessing actions:

  1. set QIDs as pandas.core.indexes.base.Index of the chunk

  2. drop columns with null values only

  3. (training) ensure one target ID per QID

  4. tokenize names, URLs, genres, when applicable

  5. (shared with preprocess_target() ) normalize columns with names, occupations, dates, when applicable

Parameters
  • goal (str) – {'training', 'classification'}. Whether the dataset is for training or classification

  • wikidata_reader (JsonReader) – a dataset reader as returned by build_wikidata()

Return type

Iterator[DataFrame]

Returns

the generator yielding preprocessed pandas.DataFrame chunks

blocking

Custom blocking technique for the Record Linkage Toolkit, where blocking stands for record pairs indexing.

In a nutshell, blocking means finding candidate pairs suitable for comparison: this is essential to avoid blind comparison of all records, thus reducing the overall complexity of the task. In a supervised learning scenario, this translates into finding relevant training and classification samples.

Given a Wikidata pandas.Series (dataset column), this technique finds samples through full-text search in natural language mode against the target catalog database.

Target catalog identifiers of the output pandas.MultiIndex are also passed to build_target() for building the actual target dataset.

soweego.linker.blocking.find_samples(goal, catalog, wikidata_column, chunk_number, target_db_entity, dir_io)[source]

Build a blocking index by looking up target catalog identifiers given a Wikidata dataset column. A meaningful column should hold strings.

Under the hood, run full-text search in natural language mode against the target catalog database.

This function uses multithreaded parallel processing.

Parameters
  • goal (str) – {'training', 'classification'}. Whether the samples are for training or classification

  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • wikidata_column (Series) – a Wikidata dataset column holding values suitable for full-text search against the target database

  • chunk_number (int) – which Wikidata chunk will undergo blocking. Typically returned by calling enumerate() over preprocess_wikidata()

  • target_db_entity (~DB_ENTITY) – an ORM entity (AKA table) of the target catalog database that full-text search should aim at

  • dir_io (str) – input/output directory where index chunks will be read/written

Return type

MultiIndex

Returns

the blocking index holding candidate pairs

features

A set of custom features suitable for the Record Linkage Toolkit, where feature extraction stands for record pairs comparison.

Input: pairs of list objects coming from Wikidata and target catalog pandas.DataFrame columns as per preprocess_wikidata() and preprocess_target() output.

Output: a feature vector pandas.Series.

All classes in this module share the following constructor parameters:

  • left_on (str) - a Wikidata column label

  • right_on (str) - a target catalog column label

  • missing_value - (optional) a score to fill null values

  • label - (optional) a label for the output feature Series

Specific parameters are documented in the __init__ method of each class.

All classes in this module implement recordlinkage.base.BaseCompareFeature, and can be added to the feature extractor object recordlinkage.Compare.

Usage:

>>> import recordlinkage as rl
>>> from soweego.linker import features
>>> extractor = rl.Compare()
>>> source_column, target_column = 'birth_name', 'fullname'
>>> feature = features.ExactMatch(source_column, target_column)
>>> extractor.add(feature)
class soweego.linker.features.ExactMatch(left_on, right_on, match_value=1.0, non_match_value=0.0, missing_value=0.0, label=None)[source]

Compare pairs of lists through exact match on each pair of elements.

__init__(left_on, right_on, match_value=1.0, non_match_value=0.0, missing_value=0.0, label=None)[source]
Parameters
  • left_on (str) – a Wikidata DataFrame column label

  • right_on (str) – a target catalog DataFrame column label

  • match_value (float) – (optional) a score when element pairs match

  • non_match_value (float) – (optional) a score when element pairs do not match

  • missing_value (float) – (optional) a score to fill null values

  • label (Optional[str]) – (optional) a label for the output feature Series

class soweego.linker.features.SimilarStrings(left_on, right_on, algorithm='levenshtein', threshold=None, missing_value=0.0, analyzer=None, ngram_range=(2, 2), label=None)[source]

Compare pairs of lists holding strings through similarity measures on each pair of elements.

__init__(left_on, right_on, algorithm='levenshtein', threshold=None, missing_value=0.0, analyzer=None, ngram_range=(2, 2), label=None)[source]
Parameters
  • left_on (str) – a Wikidata DataFrame column label

  • right_on (str) – a target catalog DataFrame column label

  • algorithm (str) –

    (optional) {'cosine', 'levenshtein'}. A string similarity algorithm, either the cosine similarity or the Levenshtein distance respectively

  • threshold (Optional[float]) – (optional) a threshold to filter features with a lower or equal score

  • missing_value (float) – (optional) a score to fill null values

  • analyzer (Optional[str]) –

    (optional, only applies when algorithm=’cosine’) {'soweego', 'word', 'char', 'char_wb'}. A text analyzer to preprocess input. It is passed to the analyzer parameter of sklearn.feature_extraction.text.CountVectorizer.

    • 'soweego' is soweego.commons.text_utils.tokenize()

    • {'word', 'char', 'char_wb'} are scikit built-ins. See here for more details

    • None is str.split(), and means input is already preprocessed

  • ngram_range (Tuple[int, int]) – (optional, only applies when algorithm=’cosine’ and analyzer is not ‘soweego’). Lower and upper boundary for n-gram extraction, passed to CountVectorizer

  • label (Optional[str]) – (optional) a label for the output feature Series

class soweego.linker.features.SimilarDates(left_on, right_on, missing_value=0.0, label=None)[source]

Compare pairs of lists holding dates through match by maximum shared precision.

__init__(left_on, right_on, missing_value=0.0, label=None)[source]
Parameters
  • left_on (str) – a Wikidata DataFrame column label

  • right_on (str) – a target catalog DataFrame column label

  • missing_value (float) – (optional) a score to fill null values

  • label (Optional[str]) – (optional) a label for the output feature Series

class soweego.linker.features.SharedTokens(left_on, right_on, missing_value=0.0, label=None)[source]

Compare pairs of lists holding string tokens through weighted intersection.

__init__(left_on, right_on, missing_value=0.0, label=None)[source]
Parameters
  • left_on (str) – a Wikidata DataFrame column label

  • right_on (str) – a target catalog DataFrame column label

  • missing_value (float) – (optional) a score to fill null values

  • label (Optional[str]) – (optional) a label for the output feature Series

class soweego.linker.features.SharedOccupations(left_on, right_on, missing_value=0.0, label=None)[source]

Compare pairs of lists holding occupation QIDs (ontology classes) through expansion of the class hierarchy, plus intersection of values.

__init__(left_on, right_on, missing_value=0.0, label=None)[source]
Parameters
  • left_on (str) – a Wikidata DataFrame column label

  • right_on (str) – a target catalog DataFrame column label

  • missing_value (float) – (optional) a score to fill null values

  • label (Optional[str]) – (optional) a label for the output feature Series

class soweego.linker.features.SharedTokensPlus(left_on, right_on, missing_value=0.0, label=None, stop_words=None)[source]

Compare pairs of lists holding string tokens through weighted intersection.

This feature is similar to SharedTokens, but has extra functionality:

  • handles arbitrary stop words

  • accepts nested list of tokens

  • output score is the percentage of tokens in the smallest set which are shared among both sets

__init__(left_on, right_on, missing_value=0.0, label=None, stop_words=None)[source]
Parameters
  • left_on (str) – a Wikidata DataFrame column label

  • right_on (str) – a target catalog DataFrame column label

  • missing_value (float) – (optional) a score to fill null values

  • label (Optional[str]) – (optional) a label for the output feature Series

  • stop_words (Optional[Set[~T]]) – (optional) a set of stop words to be filtered from input pairs

classifiers

A set of custom supervised classifiers suitable for the Record Linkage Toolkit. It includes neural networks and support-vector machines.

All classes implement recordlinkage.base.BaseClassifier: typically, you will use its fit(), predict(), and prob() methods.

class soweego.linker.classifiers.GatedEnsembleClassifier(num_features, **kwargs)[source]

Ensemble of classifiers, whose predictions are joined by using a further meta-learner, which decides the final output based on the prediction of the base classifiers.

This classifier uses mlens.ensemble.SuperLearner to implement the gating functionality.

The parameters, and their default values, are:

  • meta_layer: Name of the classifier to use as a meta layer. By

    default this is single_layer_perceptron

  • folds: The number of folds to use for cross validation when

    generating the training set for the meta_layer. The default value for this is 2.

    For a better explanation of this parameter, see:

    Polley, Eric C. and van der Laan, Mark J., “Super Learner In Prediction” (May 2010). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 266 https://biostats.bepress.com/ucbbiostat/paper266/

class soweego.linker.classifiers.MultiLayerPerceptron(num_features, **kwargs)[source]

A multi-layer perceptron classifier.

This class implements a keras.Sequential model with the following default architecture:

  • Dense layer 1, with 128 output dimension and relu activation function

  • BatchNormalization layer

  • Dense layer 2, with 32 output dimension and relu activation function

  • BatchNormalization layer

  • Dense layer 3, with 1 output dimension and sigmoid activation function

  • adadelta optimizer

  • binary_crossentropy loss function

  • accuracy metric for evaluation

If you want to override the default parameters, you can pass the following keyword arguments to the constructor:

class soweego.linker.classifiers.RandomForest(*args, **kwargs)[source]

A Random Forest classifier.

This class implements sklearn.ensemble.RandomForestClassifier, and receives the same parameters.

It fits multiple decision trees on sub-samples (aka, parts) of the dataset and averages the result to get more accuracy and reduce over-fitting.

The default parameters are:

  • n_estimators: 500

  • criterion: entropy

  • max_features: None

  • bootstrap: True

prob(feature_vectors)[source]

Classify record pairs and include the probability score of being a match.

Parameters

feature_vectors (DataFrame) – a DataFrame computed via record pairs comparison. This should be recordlinkage.Compare.compute() output. See extract_features() for more details

Return type

Series

Returns

the classification results

class soweego.linker.classifiers.SVCClassifier(*args, **kwargs)[source]

A support-vector machine classifier.

This class implements sklearn.svm.SVC, which is based on the libsvm library.

This classifier differs from recordlinkage.classifiers.SVMClassifier, which implements sklearn.svm.LinearSVC, based on the liblinear library.

Main highlights:

  • output probability scores

  • can use non-linear kernels

  • higher training time (quadratic to the number of samples)

prob(feature_vectors)[source]

Classify record pairs and include the probability score of being a match.

Parameters

feature_vectors (DataFrame) – a DataFrame computed via record pairs comparison. This should be recordlinkage.Compare.compute() output. See extract_features() for more details

Return type

Series

Returns

the classification results

class soweego.linker.classifiers.SingleLayerPerceptron(num_features, **kwargs)[source]

A single-layer perceptron classifier.

This class implements a keras.Sequential model with the following default architecture:

  • single Dense layer

  • sigmoid activation function

  • adam optimizer

  • binary_crossentropy loss function

  • accuracy metric for evaluation

If you want to override the default parameters, you can pass the following keyword arguments to the constructor:

class soweego.linker.classifiers.StackedEnsembleClassifier(num_features, **kwargs)[source]

Ensemble of stacked classifiers, meaning that classifiers are arranged in layers with the next layer getting as input the output of the last layer. The predictions of the final layer are merged with a meta-learner (the same happens for ~:class:soweego.linker.GatedEnsembleClassifier), which decides the final output based on the prediction of the base classifiers.

This classifier uses mlens.ensemble.SuperLearner to implement the stacking functionality.

The parameters, and their default values, are:

  • meta_layer: Name of the classifier to use as a meta layer. By

    default this is single_layer_perceptron

  • folds: The number of folds to use for cross validation when

    generating the training set for the meta_layer. The default value for this is 2.

    For a better explanation of this parameter, see:

    Polley, Eric C. and van der Laan, Mark J., “Super Learner In Prediction” (May 2010). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 266 https://biostats.bepress.com/ucbbiostat/paper266/

class soweego.linker.classifiers.VotingClassifier(num_features, **kwargs)[source]

A basic ensemble classifier which uses a voting procedure to decide the final outcome of a prediction.

This class implements sklearn.ensemble.VotingClassifier.

It combines a set of classifiers and uses majority vote or average predicted probabilities to pick the final prediction. See scikit’s user guide.

The parameter voting can have as values either “hard” or “soft”.

  • hard - the label predicted by the majority of base classifiers is used as the

    final prediction. Note that this does not return probabilities, only the final label.

  • soft - the probability that a pair is a match is taken from all base classifiers

    and then averaged. This average is what is returned by the classifier.

By default voting=soft.

prob(feature_vectors)[source]

Classify record pairs and include the probability score of being a match.

Parameters

feature_vectors (DataFrame) – a DataFrame computed via record pairs comparison. This should be recordlinkage.Compare.compute() output. See extract_features() for more details

Return type

Series

Returns

the classification results

train

Train supervised linking algorithms.

soweego.linker.train.build_training_set(catalog, entity, dir_io)[source]

Build a training set.

Parameters
  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity

  • dir_io (str) – input/output directory where working files will be read/written

Return type

Tuple[DataFrame, MultiIndex]

Returns

the feature vectors and positive samples pair. Features are computed by comparing (QID, catalog ID) pairs. Positive samples are catalog IDs available in Wikidata

soweego.linker.train.execute(classifier, catalog, entity, tune, k, dir_io, **kwargs)[source]

Train a supervised linker.

  1. Build the training set relevant to the given catalog and entity

  2. train a model with the given classifier

Parameters
  • classifier (str) – {'naive_bayes', 'linear_support_vector_machines', 'support_vector_machines', 'single_layer_perceptron', 'multi_layer_perceptron'}. A supported classifier

  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity

  • tune (bool) – whether to run grid search for hyperparameters tuning or not

  • k (int) – number of folds for hyperparameters tuning. It is used only when tune=True

  • dir_io (str) – input/output directory where working files will be read/written

  • kwargs – extra keyword arguments that will be passed to the model initialization

Return type

BaseClassifier

Returns

the trained model

evaluate

Evaluate supervised linking algorithms.