linker

This is soweego’s core, where Wikidata items get linked to target catalog identifiers.

workflow

Record linkage workflow. It is a pipeline composed of the following main steps:

  1. build the Wikidata (build_wikidata()) and target (build_target()) datasets

  2. preprocess both (preprocess_wikidata() and preprocess_target())

  3. extract features by comparing pairs of Wikidata and target values (extract_features())

soweego.linker.workflow.build_target(goal, catalog, entity, identifiers)[source]

Build a target catalog dataset for training or classification purposes: workflow step 1.

Data is gathered by querying the s51434__mixnmatch_large_catalogs_p database. This is where the importer inserts processed catalog dumps.

The database is located in ToolsDB under the Wikimedia Toolforge infrastructure. See how to connect.

Parameters
  • goal (str) – {'training', 'classification'}. Whether to build a dataset for training or classification

  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity

  • identifiers (Set[str]) – a set of catalog IDs to gather data for

Return type

Iterator[DataFrame]

Returns

the generator yielding pandas.DataFrame chunks

soweego.linker.workflow.build_wikidata(goal, catalog, entity, dir_io)[source]

Build a Wikidata dataset for training or classification purposes: workflow step 1.

Data is gathered from the SPARQL endpoint and the Web API.

How it works:

  1. gather relevant Wikidata items that hold (for training) or lack (for classification) identifiers of the given catalog

  2. gather relevant item data

  3. dump the dataset to a gzipped JSON Lines file

  4. read the dataset into a generator of pandas.DataFrame chunks for memory-efficient processing

Parameters
  • goal (str) – {'training', 'classification'}. Whether to build a dataset for training or classification

  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity

  • dir_io (str) – input/output directory where working files will be read/written

Return type

JsonReader

Returns

the generator yielding pandas.DataFrame chunks

soweego.linker.workflow.extract_features(candidate_pairs, wikidata, target, path_io)[source]

Extract feature vectors by comparing pairs of (Wikidata, target catalog) records.

Main features:

  • exact match on full names and URLs

  • match on tokenized names, URLs, and genres

  • Levenshtein distance on name tokens

  • string kernel similarity on name tokens

  • weighted intersection on name tokens

  • match on dates by maximum shared precision

  • cosine similarity on textual descriptions

  • match on occupation QIDs

See features for more details.

This function uses multithreaded parallel processing.

Parameters
  • candidate_pairs (MultiIndex) – an index of (QID, target ID) pairs that should undergo comparison

  • wikidata (DataFrame) – a preprocessed Wikidata dataset (typically a chunk)

  • target (DataFrame) – a preprocessed target catalog dataset (typically a chunk)

  • path_io (str) – input/output path to an extracted feature file

Return type

DataFrame

Returns

the feature vectors dataset

soweego.linker.workflow.preprocess_target(goal, target_reader)[source]

Preprocess a target catalog dataset: workflow step 2.

This function consumes pandas.DataFrame chunks and should be pipelined after build_target().

Preprocessing actions:

  1. drop unneeded columns holding target DB primary keys

  2. rename non-null catalog ID columns & drop others

  3. drop columns with null values only

  4. pair dates with their precision and drop precision columns when applicable

  5. aggregate denormalized data on target ID

  6. (shared with preprocess_wikidata() ) normalize columns with names, occupations, dates, when applicable

Parameters
  • goal (str) – {'training', 'classification'}. Whether the dataset is for training or classification

  • target_reader (Iterator[DataFrame]) – a dataset reader as returned by build_target()

Return type

DataFrame

Returns

the generator yielding preprocessed pandas.DataFrame chunks

soweego.linker.workflow.preprocess_wikidata(goal, wikidata_reader)[source]

Preprocess a Wikidata dataset: workflow step 2.

This function consumes pandas.DataFrame chunks and should be pipelined after build_wikidata().

Preprocessing actions:

  1. set QIDs as pandas.core.indexes.base.Index of the chunk

  2. drop columns with null values only

  3. (training) ensure one target ID per QID

  4. tokenize names, URLs, genres, when applicable

  5. (shared with preprocess_target() ) normalize columns with names, occupations, dates, when applicable

Parameters
  • goal (str) – {'training', 'classification'}. Whether the dataset is for training or classification

  • wikidata_reader (JsonReader) – a dataset reader as returned by build_wikidata()

Return type

Iterator[DataFrame]

Returns

the generator yielding preprocessed pandas.DataFrame chunks

blocking

Custom blocking technique for the Record Linkage Toolkit, where blocking stands for record pairs indexing.

In a nutshell, blocking means finding candidate pairs suitable for comparison: this is essential to avoid blind comparison of all records, thus reducing the overall complexity of the task. In a supervised learning scenario, this translates into finding relevant training and classification samples.

Given a Wikidata pandas.Series (dataset column), this technique finds samples through full-text search in natural language mode against the target catalog database.

Target catalog identifiers of the output pandas.MultiIndex are also passed to build_target() for building the actual target dataset.

soweego.linker.blocking.find_samples(goal, catalog, wikidata_column, chunk_number, target_db_entity, dir_io)[source]

Build a blocking index by looking up target catalog identifiers given a Wikidata dataset column. A meaningful column should hold strings.

Under the hood, run full-text search in natural language mode against the target catalog database.

This function uses multithreaded parallel processing.

Parameters
  • goal (str) – {'training', 'classification'}. Whether the samples are for training or classification

  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • wikidata_column (Series) – a Wikidata dataset column holding values suitable for full-text search against the target database

  • chunk_number (int) – which Wikidata chunk will undergo blocking. Typically returned by calling enumerate() over preprocess_wikidata()

  • target_db_entity (~DB_ENTITY) – an ORM entity (AKA table) of the target catalog database that full-text search should aim at

  • dir_io (str) – input/output directory where index chunks will be read/written

Return type

MultiIndex

Returns

the blocking index holding candidate pairs

features

A set of custom features suitable for the Record Linkage Toolkit, where feature extraction stands for record pairs comparison.

Input: pairs of list objects coming from Wikidata and target catalog pandas.DataFrame columns as per preprocess_wikidata() and preprocess_target() output.

Output: a feature vector pandas.Series.

All classes in this module share the following constructor parameters:

  • left_on (str) - a Wikidata column label

  • right_on (str) - a target catalog column label

  • missing_value - (optional) a score to fill null values

  • label - (optional) a label for the output feature Series

Specific parameters are documented in the __init__ method of each class.

All classes in this module implement recordlinkage.base.BaseCompareFeature, and can be added to the feature extractor object recordlinkage.Compare.

Usage:

>>> import recordlinkage as rl
>>> from soweego.linker import features
>>> extractor = rl.Compare()
>>> source_column, target_column = 'birth_name', 'fullname'
>>> feature = features.ExactMatch(source_column, target_column)
>>> extractor.add(feature)
class soweego.linker.features.ExactMatch(left_on, right_on, match_value=1.0, non_match_value=0.0, missing_value=0.0, label=None)[source]

Compare pairs of lists through exact match on each pair of elements.

__init__(left_on, right_on, match_value=1.0, non_match_value=0.0, missing_value=0.0, label=None)[source]
Parameters
  • left_on (str) – a Wikidata DataFrame column label

  • right_on (str) – a target catalog DataFrame column label

  • match_value (float) – (optional) a score when element pairs match

  • non_match_value (float) – (optional) a score when element pairs do not match

  • missing_value (float) – (optional) a score to fill null values

  • label (Optional[str]) – (optional) a label for the output feature Series

class soweego.linker.features.SimilarStrings(left_on, right_on, algorithm='levenshtein', threshold=None, missing_value=0.0, analyzer=None, ngram_range=(2, 2), label=None)[source]

Compare pairs of lists holding strings through similarity measures on each pair of elements.

__init__(left_on, right_on, algorithm='levenshtein', threshold=None, missing_value=0.0, analyzer=None, ngram_range=(2, 2), label=None)[source]
Parameters
  • left_on (str) – a Wikidata DataFrame column label

  • right_on (str) – a target catalog DataFrame column label

  • algorithm (str) –

    (optional) {'cosine', 'levenshtein'}. A string similarity algorithm, either the cosine similarity or the Levenshtein distance respectively

  • threshold (Optional[float]) – (optional) a threshold to filter features with a lower or equal score

  • missing_value (float) – (optional) a score to fill null values

  • analyzer (Optional[str]) –

    (optional, only applies when algorithm=’cosine’) {'soweego', 'word', 'char', 'char_wb'}. A text analyzer to preprocess input. It is passed to the analyzer parameter of sklearn.feature_extraction.text.CountVectorizer.

    • 'soweego' is soweego.commons.text_utils.tokenize()

    • {'word', 'char', 'char_wb'} are scikit built-ins. See here for more details

    • None is str.split(), and means input is already preprocessed

  • ngram_range (Tuple[int, int]) – (optional, only applies when algorithm=’cosine’ and analyzer is not ‘soweego’). Lower and upper boundary for n-gram extraction, passed to CountVectorizer

  • label (Optional[str]) – (optional) a label for the output feature Series

class soweego.linker.features.SimilarDates(left_on, right_on, missing_value=0.0, label=None)[source]

Compare pairs of lists holding dates through match by maximum shared precision.

__init__(left_on, right_on, missing_value=0.0, label=None)[source]
Parameters
  • left_on (str) – a Wikidata DataFrame column label

  • right_on (str) – a target catalog DataFrame column label

  • missing_value (float) – (optional) a score to fill null values

  • label (Optional[str]) – (optional) a label for the output feature Series

class soweego.linker.features.SharedTokens(left_on, right_on, missing_value=0.0, label=None)[source]

Compare pairs of lists holding string tokens through weighted intersection.

__init__(left_on, right_on, missing_value=0.0, label=None)[source]
Parameters
  • left_on (str) – a Wikidata DataFrame column label

  • right_on (str) – a target catalog DataFrame column label

  • missing_value (float) – (optional) a score to fill null values

  • label (Optional[str]) – (optional) a label for the output feature Series

class soweego.linker.features.SharedOccupations(left_on, right_on, missing_value=0.0, label=None)[source]

Compare pairs of lists holding occupation QIDs (ontology classes) through expansion of the class hierarchy, plus intersection of values.

__init__(left_on, right_on, missing_value=0.0, label=None)[source]
Parameters
  • left_on (str) – a Wikidata DataFrame column label

  • right_on (str) – a target catalog DataFrame column label

  • missing_value (float) – (optional) a score to fill null values

  • label (Optional[str]) – (optional) a label for the output feature Series

class soweego.linker.features.SharedTokensPlus(left_on, right_on, missing_value=0.0, label=None, stop_words=None)[source]

Compare pairs of lists holding string tokens through weighted intersection.

This feature is similar to SharedTokens, but has extra functionality:

  • handles arbitrary stop words

  • accepts nested list of tokens

  • output score is the percentage of tokens in the smallest set which are shared among both sets

__init__(left_on, right_on, missing_value=0.0, label=None, stop_words=None)[source]
Parameters
  • left_on (str) – a Wikidata DataFrame column label

  • right_on (str) – a target catalog DataFrame column label

  • missing_value (float) – (optional) a score to fill null values

  • label (Optional[str]) – (optional) a label for the output feature Series

  • stop_words (Optional[Set[~T]]) – (optional) a set of stop words to be filtered from input pairs

classifiers

A set of custom supervised classifiers suitable for the Record Linkage Toolkit. It includes neural networks and support-vector machines.

All classes implement recordlinkage.base.BaseClassifier: typically, you will use its fit(), predict(), and prob() methods.

class soweego.linker.classifiers.MultiLayerPerceptron(input_dimension, **kwargs)[source]

A multi-layer perceptron classifier.

This class implements a keras.Sequential model with the following default architecture:

  • Dense layer 1, with 128 output dimension and relu activation function

  • BatchNormalization layer

  • Dense layer 2, with 32 output dimension and relu activation function

  • BatchNormalization layer

  • Dense layer 3, with 1 output dimension and sigmoid activation function

  • adadelta optimizer

  • binary_crossentropy loss function

  • accuracy metric for evaluation

If you want to override the default parameters, you can pass the following keyword arguments to the constructor:

class soweego.linker.classifiers.RandomForest(*args, **kwargs)[source]

A Random Forest classifier.

This class implements sklearn.ensemble.RandomForestClassifier.

It fits multiple decision trees on sub-samples (aka, parts) of the dataset and averages the result to get more accuracy and reduce over-fitting.

prob(feature_vectors)[source]

Classify record pairs and include the probability score of being a match.

Parameters

feature_vectors (DataFrame) – a DataFrame computed via record pairs comparison. This should be recordlinkage.Compare.compute() output. See extract_features() for more details

Return type

DataFrame

Returns

the classification results

class soweego.linker.classifiers.SVCClassifier(*args, **kwargs)[source]

A support-vector machine classifier.

This class implements sklearn.svm.SVC, which is based on the libsvm library.

This classifier differs from recordlinkage.classifiers.SVMClassifier, which implements sklearn.svm.LinearSVC, based on the liblinear library.

Main highlights:

  • output probability scores

  • can use non-linear kernels

  • higher training time (quadratic to the number of samples)

prob(feature_vectors)[source]

Classify record pairs and include the probability score of being a match.

Parameters

feature_vectors (DataFrame) – a DataFrame computed via record pairs comparison. This should be recordlinkage.Compare.compute() output. See extract_features() for more details

Return type

DataFrame

Returns

the classification results

class soweego.linker.classifiers.SingleLayerPerceptron(input_dimension, **kwargs)[source]

A single-layer perceptron classifier.

This class implements a keras.Sequential model with the following default architecture:

  • single Dense layer

  • sigmoid activation function

  • adam optimizer

  • binary_crossentropy loss function

  • accuracy metric for evaluation

If you want to override the default parameters, you can pass the following keyword arguments to the constructor:

class soweego.linker.classifiers.VoteClassifier(num_features, **kwargs)[source]

Basic ensemble classifier which chooses the correct prediction by using majority voting (aka ‘hard’ voting) or chooses the label which has the most total probability (the argmax of the sum of predictions), aka ‘soft’ voting.

prob(feature_vectors)[source]

Classify record pairs and include the probability score of being a match.

Parameters

feature_vectors (DataFrame) – a DataFrame computed via record pairs comparison. This should be recordlinkage.Compare.compute() output. See extract_features() for more details

Return type

DataFrame

Returns

the classification results

train

Train supervised linking algorithms.

soweego.linker.train.build_training_set(catalog, entity, dir_io)[source]

Build a training set.

Parameters
  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity

  • dir_io (str) – input/output directory where working files will be read/written

Return type

Tuple[DataFrame, MultiIndex]

Returns

the feature vectors and positive samples pair. Features are computed by comparing (QID, catalog ID) pairs. Positive samples are catalog IDs available in Wikidata

soweego.linker.train.execute(classifier, catalog, entity, tune, k, dir_io, **kwargs)[source]

Train a supervised linker.

  1. Build the training set relevant to the given catalog and entity

  2. train a model with the given classifier

Parameters
  • classifier (str) – {'naive_bayes', 'linear_support_vector_machines', 'support_vector_machines', 'single_layer_perceptron', 'multi_layer_perceptron'}. A supported classifier

  • catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog

  • entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity

  • tune (bool) – whether to run grid search for hyperparameters tuning or not

  • k (int) – number of folds for hyperparameters tuning. It is used only when tune=True

  • dir_io (str) – input/output directory where working files will be read/written

  • kwargs – extra keyword arguments that will be passed to the model initialization

Return type

BaseClassifier

Returns

the trained model