`linker`¶

This is soweego’s core, where Wikidata items get linked to target catalog identifiers.

`workflow`¶

Record linkage workflow. It is a pipeline composed of the following main steps:

build the Wikidata (build_wikidata()) and target (build_target()) datasets
preprocess both (preprocess_wikidata() and preprocess_target())
extract features by comparing pairs of Wikidata and target values (extract_features())

soweego.linker.workflow.build_target(goal, catalog, entity, identifiers)[source]¶

Build a target catalog dataset for training or classification purposes: workflow step 1.

Data is gathered by querying the s51434__mixnmatch_large_catalogs_p database. This is where the importer inserts processed catalog dumps.

The database is located in ToolsDB under the Wikimedia Toolforge infrastructure. See how to connect.

Parameters

goal (str) – {'training', 'classification'}. Whether to build a dataset for training or classification
catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog
entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity
identifiers (Set[str]) – a set of catalog IDs to gather data for

Return type

Iterator[DataFrame]

Returns

the generator yielding pandas.DataFrame chunks

soweego.linker.workflow.build_wikidata(goal, catalog, entity, dir_io)[source]¶

Build a Wikidata dataset for training or classification purposes: workflow step 1.

Data is gathered from the SPARQL endpoint and the Web API.

How it works:

gather relevant Wikidata items that hold (for training) or lack (for classification) identifiers of the given catalog
gather relevant item data
dump the dataset to a gzipped JSON Lines file
read the dataset into a generator of pandas.DataFrame chunks for memory-efficient processing

Parameters

goal (str) – {'training', 'classification'}. Whether to build a dataset for training or classification
catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog
entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity
dir_io (str) – input/output directory where working files will be read/written

Return type

JsonReader

Returns

the generator yielding pandas.DataFrame chunks

soweego.linker.workflow.extract_features(candidate_pairs, wikidata, target, path_io)[source]¶

Extract feature vectors by comparing pairs of (Wikidata, target catalog) records.

Main features:

exact match on full names and URLs
match on tokenized names, URLs, and genres
Levenshtein distance on name tokens
string kernel similarity on name tokens
weighted intersection on name tokens
match on dates by maximum shared precision
cosine similarity on textual descriptions
match on occupation QIDs

See features for more details.

This function uses multithreaded parallel processing.

Parameters

candidate_pairs (MultiIndex) – an index of (QID, target ID) pairs that should undergo comparison
wikidata (DataFrame) – a preprocessed Wikidata dataset (typically a chunk)
target (DataFrame) – a preprocessed target catalog dataset (typically a chunk)
path_io (str) – input/output path to an extracted feature file

Return type

DataFrame

Returns

the feature vectors dataset

soweego.linker.workflow.preprocess_target(goal, target_reader)[source]¶

Preprocess a target catalog dataset: workflow step 2.

This function consumes pandas.DataFrame chunks and should be pipelined after build_target().

Preprocessing actions:

drop unneeded columns holding target DB primary keys
rename non-null catalog ID columns & drop others
drop columns with null values only
pair dates with their precision and drop precision columns when applicable
aggregate denormalized data on target ID
(shared with preprocess_wikidata() ) normalize columns with names, occupations, dates, when applicable

Parameters

goal (str) – {'training', 'classification'}. Whether the dataset is for training or classification
target_reader (Iterator[DataFrame]) – a dataset reader as returned by build_target()

Return type

DataFrame

Returns

the generator yielding preprocessed pandas.DataFrame chunks

soweego.linker.workflow.preprocess_wikidata(goal, wikidata_reader)[source]¶

Preprocess a Wikidata dataset: workflow step 2.

This function consumes pandas.DataFrame chunks and should be pipelined after build_wikidata().

Preprocessing actions:

set QIDs as pandas.core.indexes.base.Index of the chunk
drop columns with null values only
(training) ensure one target ID per QID
tokenize names, URLs, genres, when applicable
(shared with preprocess_target() ) normalize columns with names, occupations, dates, when applicable

Parameters

goal (str) – {'training', 'classification'}. Whether the dataset is for training or classification
wikidata_reader (JsonReader) – a dataset reader as returned by build_wikidata()

Return type

Iterator[DataFrame]

Returns

the generator yielding preprocessed pandas.DataFrame chunks

`blocking`¶

Custom blocking technique for the Record Linkage Toolkit, where blocking stands for record pairs indexing.

In a nutshell, blocking means finding candidate pairs suitable for comparison: this is essential to avoid blind comparison of all records, thus reducing the overall complexity of the task. In a supervised learning scenario, this translates into finding relevant training and classification samples.

Given a Wikidata pandas.Series (dataset column), this technique finds samples through full-text search in natural language mode against the target catalog database.

Target catalog identifiers of the output pandas.MultiIndex are also passed to build_target() for building the actual target dataset.

soweego.linker.blocking.find_samples(goal, catalog, wikidata_column, chunk_number, target_db_entity, dir_io)[source]¶

Build a blocking index by looking up target catalog identifiers given a Wikidata dataset column. A meaningful column should hold strings.

Under the hood, run full-text search in natural language mode against the target catalog database.

This function uses multithreaded parallel processing.

Parameters

goal (str) – {'training', 'classification'}. Whether the samples are for training or classification
catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog
wikidata_column (Series) – a Wikidata dataset column holding values suitable for full-text search against the target database
chunk_number (int) – which Wikidata chunk will undergo blocking. Typically returned by calling enumerate() over preprocess_wikidata()
target_db_entity (~DB_ENTITY) – an ORM entity (AKA table) of the target catalog database that full-text search should aim at
dir_io (str) – input/output directory where index chunks will be read/written

Return type

MultiIndex

Returns

the blocking index holding candidate pairs

`features`¶

A set of custom features suitable for the Record Linkage Toolkit, where feature extraction stands for record pairs comparison.

Input: pairs of list objects coming from Wikidata and target catalog pandas.DataFrame columns as per preprocess_wikidata() and preprocess_target() output.

Output: a feature vector pandas.Series.

All classes in this module share the following constructor parameters:

left_on (str) - a Wikidata column label
right_on (str) - a target catalog column label
missing_value - (optional) a score to fill null values
label - (optional) a label for the output feature Series

Specific parameters are documented in the __init__ method of each class.

All classes in this module implement recordlinkage.base.BaseCompareFeature, and can be added to the feature extractor object recordlinkage.Compare.

Usage:

>>> import recordlinkage as rl
>>> from soweego.linker import features
>>> extractor = rl.Compare()
>>> source_column, target_column = 'birth_name', 'fullname'
>>> feature = features.ExactMatch(source_column, target_column)
>>> extractor.add(feature)

class soweego.linker.features.ExactMatch(left_on, right_on, match_value=1.0, non_match_value=0.0, missing_value=0.0, label=None)[source]¶

Compare pairs of lists through exact match on each pair of elements.

__init__(left_on, right_on, match_value=1.0, non_match_value=0.0, missing_value=0.0, label=None)[source]¶

Parameters

left_on (str) – a Wikidata DataFrame column label
right_on (str) – a target catalog DataFrame column label
match_value (float) – (optional) a score when element pairs match
non_match_value (float) – (optional) a score when element pairs do not match
missing_value (float) – (optional) a score to fill null values
label (Optional[str]) – (optional) a label for the output feature Series

class soweego.linker.features.SimilarStrings(left_on, right_on, algorithm='levenshtein', threshold=None, missing_value=0.0, analyzer=None, ngram_range=(2, 2), label=None)[source]¶

Compare pairs of lists holding strings through similarity measures on each pair of elements.

__init__(left_on, right_on, algorithm='levenshtein', threshold=None, missing_value=0.0, analyzer=None, ngram_range=(2, 2), label=None)[source]¶

Parameters

left_on (str) – a Wikidata DataFrame column label
right_on (str) – a target catalog DataFrame column label
algorithm (str) –
(optional) {'cosine', 'levenshtein'}. A string similarity algorithm, either the cosine similarity or the Levenshtein distance respectively
threshold (Optional[float]) – (optional) a threshold to filter features with a lower or equal score
missing_value (float) – (optional) a score to fill null values
analyzer (Optional[str]) –
(optional, only applies when algorithm=’cosine’) {'soweego', 'word', 'char', 'char_wb'}. A text analyzer to preprocess input. It is passed to the analyzer parameter of sklearn.feature_extraction.text.CountVectorizer.
- 'soweego' is soweego.commons.text_utils.tokenize()
- {'word', 'char', 'char_wb'} are scikit built-ins. See here for more details
- None is str.split(), and means input is already preprocessed
ngram_range (Tuple[int, int]) – (optional, only applies when algorithm=’cosine’ and analyzer is not ‘soweego’). Lower and upper boundary for n-gram extraction, passed to CountVectorizer
label (Optional[str]) – (optional) a label for the output feature Series

class soweego.linker.features.SimilarDates(left_on, right_on, missing_value=0.0, label=None)[source]¶

Compare pairs of lists holding dates through match by maximum shared precision.

__init__(left_on, right_on, missing_value=0.0, label=None)[source]¶

Parameters

left_on (str) – a Wikidata DataFrame column label
right_on (str) – a target catalog DataFrame column label
missing_value (float) – (optional) a score to fill null values
label (Optional[str]) – (optional) a label for the output feature Series

class soweego.linker.features.SharedTokens(left_on, right_on, missing_value=0.0, label=None)[source]¶

Compare pairs of lists holding string tokens through weighted intersection.

__init__(left_on, right_on, missing_value=0.0, label=None)[source]¶

Parameters

left_on (str) – a Wikidata DataFrame column label
right_on (str) – a target catalog DataFrame column label
missing_value (float) – (optional) a score to fill null values
label (Optional[str]) – (optional) a label for the output feature Series

class soweego.linker.features.SharedOccupations(left_on, right_on, missing_value=0.0, label=None)[source]¶

Compare pairs of lists holding occupation QIDs (ontology classes) through expansion of the class hierarchy, plus intersection of values.

__init__(left_on, right_on, missing_value=0.0, label=None)[source]¶

Parameters

left_on (str) – a Wikidata DataFrame column label
right_on (str) – a target catalog DataFrame column label
missing_value (float) – (optional) a score to fill null values
label (Optional[str]) – (optional) a label for the output feature Series

class soweego.linker.features.SharedTokensPlus(left_on, right_on, missing_value=0.0, label=None, stop_words=None)[source]¶

Compare pairs of lists holding string tokens through weighted intersection.

This feature is similar to SharedTokens, but has extra functionality:

handles arbitrary stop words
accepts nested list of tokens
output score is the percentage of tokens in the smallest set which are shared among both sets

__init__(left_on, right_on, missing_value=0.0, label=None, stop_words=None)[source]¶

Parameters

left_on (str) – a Wikidata DataFrame column label
right_on (str) – a target catalog DataFrame column label
missing_value (float) – (optional) a score to fill null values
label (Optional[str]) – (optional) a label for the output feature Series
stop_words (Optional[Set[~T]]) – (optional) a set of stop words to be filtered from input pairs

`classifiers`¶

A set of custom supervised classifiers suitable for the Record Linkage Toolkit. It includes neural networks and support-vector machines.

All classes implement recordlinkage.base.BaseClassifier: typically, you will use its fit(), predict(), and prob() methods.

class soweego.linker.classifiers.GatedEnsembleClassifier(num_features, **kwargs)[source]¶

Ensemble of classifiers, whose predictions are joined by using a further meta-learner, which decides the final output based on the prediction of the base classifiers.

This classifier uses mlens.ensemble.SuperLearner to implement the gating functionality.

The parameters, and their default values, are:

meta_layer: Name of the classifier to use as a meta layer. By
default this is single_layer_perceptron
folds: The number of folds to use for cross validation when
generating the training set for the meta_layer. The default value for this is 2.

For a better explanation of this parameter, see:

Polley, Eric C. and van der Laan, Mark J., “Super Learner In Prediction” (May 2010). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 266 https://biostats.bepress.com/ucbbiostat/paper266/

class soweego.linker.classifiers.MultiLayerPerceptron(num_features, **kwargs)[source]¶

A multi-layer perceptron classifier.

This class implements a keras.Sequential model with the following default architecture:

Dense layer 1, with 128 output dimension and relu activation function
BatchNormalization layer
Dense layer 2, with 32 output dimension and relu activation function
BatchNormalization layer
Dense layer 3, with 1 output dimension and sigmoid activation function
adadelta optimizer
binary_crossentropy loss function
accuracy metric for evaluation

If you want to override the default parameters, you can pass the following keyword arguments to the constructor:

activations - a triple with values for (dense layer 1, dense layer 2, dense layer 3). See available activations
optimizer - see optimizers
loss - see available loss functions
metrics - see available metrics

class soweego.linker.classifiers.RandomForest(*args, **kwargs)[source]¶

A Random Forest classifier.

This class implements sklearn.ensemble.RandomForestClassifier, and receives the same parameters.

It fits multiple decision trees on sub-samples (aka, parts) of the dataset and averages the result to get more accuracy and reduce over-fitting.

The default parameters are:

n_estimators: 500
criterion: entropy
max_features: None
bootstrap: True

prob(feature_vectors)[source]¶

Classify record pairs and include the probability score of being a match.

Parameters: feature_vectors (DataFrame) – a DataFrame computed via record pairs comparison. This should be recordlinkage.Compare.compute() output. See extract_features() for more details
Return type: Series
Returns: the classification results

class soweego.linker.classifiers.SVCClassifier(*args, **kwargs)[source]¶

A support-vector machine classifier.

This class implements sklearn.svm.SVC, which is based on the libsvm library.

This classifier differs from recordlinkage.classifiers.SVMClassifier, which implements sklearn.svm.LinearSVC, based on the liblinear library.

Main highlights:

output probability scores
can use non-linear kernels
higher training time (quadratic to the number of samples)

prob(feature_vectors)[source]¶

Classify record pairs and include the probability score of being a match.

Parameters: feature_vectors (DataFrame) – a DataFrame computed via record pairs comparison. This should be recordlinkage.Compare.compute() output. See extract_features() for more details
Return type: Series
Returns: the classification results

class soweego.linker.classifiers.SingleLayerPerceptron(num_features, **kwargs)[source]¶

A single-layer perceptron classifier.

This class implements a keras.Sequential model with the following default architecture:

single Dense layer
sigmoid activation function
adam optimizer
binary_crossentropy loss function
accuracy metric for evaluation

If you want to override the default parameters, you can pass the following keyword arguments to the constructor:

activation - see available activations
optimizer - see optimizers
loss - see available loss functions
metrics - see available metrics

class soweego.linker.classifiers.StackedEnsembleClassifier(num_features, **kwargs)[source]¶

Ensemble of stacked classifiers, meaning that classifiers are arranged in layers with the next layer getting as input the output of the last layer. The predictions of the final layer are merged with a meta-learner (the same happens for ~:class:soweego.linker.GatedEnsembleClassifier), which decides the final output based on the prediction of the base classifiers.

This classifier uses mlens.ensemble.SuperLearner to implement the stacking functionality.

The parameters, and their default values, are:

meta_layer: Name of the classifier to use as a meta layer. By
default this is single_layer_perceptron
folds: The number of folds to use for cross validation when
generating the training set for the meta_layer. The default value for this is 2.

For a better explanation of this parameter, see:

Polley, Eric C. and van der Laan, Mark J., “Super Learner In Prediction” (May 2010). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 266 https://biostats.bepress.com/ucbbiostat/paper266/

class soweego.linker.classifiers.VotingClassifier(num_features, **kwargs)[source]¶

A basic ensemble classifier which uses a voting procedure to decide the final outcome of a prediction.

This class implements sklearn.ensemble.VotingClassifier.

It combines a set of classifiers and uses majority vote or average predicted probabilities to pick the final prediction. See scikit’s user guide.

The parameter voting can have as values either “hard” or “soft”.

hard - the label predicted by the majority of base classifiers is used as the
final prediction. Note that this does not return probabilities, only the final label.
soft - the probability that a pair is a match is taken from all base classifiers
and then averaged. This average is what is returned by the classifier.

By default voting=soft.

prob(feature_vectors)[source]¶

Classify record pairs and include the probability score of being a match.

Parameters: feature_vectors (DataFrame) – a DataFrame computed via record pairs comparison. This should be recordlinkage.Compare.compute() output. See extract_features() for more details
Return type: Series
Returns: the classification results

`train`¶

Train supervised linking algorithms.

soweego.linker.train.build_training_set(catalog, entity, dir_io)[source]¶

Build a training set.

Parameters

catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog
entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity
dir_io (str) – input/output directory where working files will be read/written

Return type

Tuple[DataFrame, MultiIndex]

Returns

the feature vectors and positive samples pair. Features are computed by comparing (QID, catalog ID) pairs. Positive samples are catalog IDs available in Wikidata

soweego.linker.train.execute(classifier, catalog, entity, tune, k, dir_io, **kwargs)[source]¶

Train a supervised linker.

Build the training set relevant to the given catalog and entity
train a model with the given classifier

Parameters

classifier (str) – {'naive_bayes', 'linear_support_vector_machines', 'support_vector_machines', 'single_layer_perceptron', 'multi_layer_perceptron'}. A supported classifier
catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog
entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity
tune (bool) – whether to run grid search for hyperparameters tuning or not
k (int) – number of folds for hyperparameters tuning. It is used only when tune=True
dir_io (str) – input/output directory where working files will be read/written
kwargs – extra keyword arguments that will be passed to the model initialization

Return type

BaseClassifier

Returns

the trained model

`link`¶

Run supervised linkers.

soweego.linker.link.execute(model_path, catalog, entity, threshold, name_rule, dir_io)[source]¶

Run a supervised linker.

Build the classification set relevant to the given catalog and entity
generate links between Wikidata items and catalog identifiers

Parameters

model_path (str) – path to a trained model file
catalog (str) – {'discogs', 'imdb', 'musicbrainz'}. A supported catalog
entity (str) – {'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}. A supported entity
threshold (float) – minimum confidence score for generated links. Those below this value are discarded. Must be a float between 0 and 1
name_rule (bool) – whether to enable the rule on full names or not: if True, links with different full names are discarded after classification
dir_io (str) – input/output directory where working files will be read/written

Return type

Iterator[Series]

Returns

the generator yielding chunks of links

`evaluate`¶

Evaluate supervised linking algorithms.

`linker`¶

`workflow`¶

`blocking`¶

`features`¶

`classifiers`¶

`train`¶

`link`¶

`evaluate`¶

soweego

Navigation

Related Topics