linker
¶
This is soweego’s core, where Wikidata items get linked to target catalog identifiers.
workflow
¶
Record linkage workflow. It is a pipeline composed of the following main steps:
build the Wikidata (
build_wikidata()
) and target (build_target()
) datasetspreprocess both (
preprocess_wikidata()
andpreprocess_target()
)extract features by comparing pairs of Wikidata and target values (
extract_features()
)
-
soweego.linker.workflow.
build_target
(goal, catalog, entity, identifiers)[source]¶ Build a target catalog dataset for training or classification purposes: workflow step 1.
Data is gathered by querying the
s51434__mixnmatch_large_catalogs_p
database. This is where theimporter
inserts processed catalog dumps.The database is located in ToolsDB under the Wikimedia Toolforge infrastructure. See how to connect.
- Parameters
goal (
str
) –{'training', 'classification'}
. Whether to build a dataset for training or classificationcatalog (
str
) –{'discogs', 'imdb', 'musicbrainz'}
. A supported catalogentity (
str
) –{'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}
. A supported entityidentifiers (
Set
[str
]) – a set of catalog IDs to gather data for
- Return type
Iterator
[DataFrame
]- Returns
the generator yielding
pandas.DataFrame
chunks
-
soweego.linker.workflow.
build_wikidata
(goal, catalog, entity, dir_io)[source]¶ Build a Wikidata dataset for training or classification purposes: workflow step 1.
Data is gathered from the SPARQL endpoint and the Web API.
How it works:
gather relevant Wikidata items that hold (for training) or lack (for classification) identifiers of the given catalog
gather relevant item data
dump the dataset to a gzipped JSON Lines file
read the dataset into a generator of
pandas.DataFrame
chunks for memory-efficient processing
- Parameters
goal (
str
) –{'training', 'classification'}
. Whether to build a dataset for training or classificationcatalog (
str
) –{'discogs', 'imdb', 'musicbrainz'}
. A supported catalogentity (
str
) –{'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}
. A supported entitydir_io (
str
) – input/output directory where working files will be read/written
- Return type
JsonReader
- Returns
the generator yielding
pandas.DataFrame
chunks
-
soweego.linker.workflow.
extract_features
(candidate_pairs, wikidata, target, path_io)[source]¶ Extract feature vectors by comparing pairs of (Wikidata, target catalog) records.
Main features:
exact match on full names and URLs
match on tokenized names, URLs, and genres
Levenshtein distance on name tokens
string kernel similarity on name tokens
weighted intersection on name tokens
match on dates by maximum shared precision
cosine similarity on textual descriptions
match on occupation QIDs
See
features
for more details.This function uses multithreaded parallel processing.
- Parameters
candidate_pairs (
MultiIndex
) – an index of (QID, target ID) pairs that should undergo comparisonwikidata (
DataFrame
) – a preprocessed Wikidata dataset (typically a chunk)target (
DataFrame
) – a preprocessed target catalog dataset (typically a chunk)path_io (
str
) – input/output path to an extracted feature file
- Return type
DataFrame
- Returns
the feature vectors dataset
-
soweego.linker.workflow.
preprocess_target
(goal, target_reader)[source]¶ Preprocess a target catalog dataset: workflow step 2.
This function consumes
pandas.DataFrame
chunks and should be pipelined afterbuild_target()
.Preprocessing actions:
drop unneeded columns holding target DB primary keys
rename non-null catalog ID columns & drop others
drop columns with null values only
pair dates with their precision and drop precision columns when applicable
aggregate denormalized data on target ID
(shared with
preprocess_wikidata()
) normalize columns with names, occupations, dates, when applicable
- Parameters
goal (
str
) –{'training', 'classification'}
. Whether the dataset is for training or classificationtarget_reader (
Iterator
[DataFrame
]) – a dataset reader as returned bybuild_target()
- Return type
DataFrame
- Returns
the generator yielding preprocessed
pandas.DataFrame
chunks
-
soweego.linker.workflow.
preprocess_wikidata
(goal, wikidata_reader)[source]¶ Preprocess a Wikidata dataset: workflow step 2.
This function consumes
pandas.DataFrame
chunks and should be pipelined afterbuild_wikidata()
.Preprocessing actions:
set QIDs as
pandas.core.indexes.base.Index
of the chunkdrop columns with null values only
(training) ensure one target ID per QID
tokenize names, URLs, genres, when applicable
(shared with
preprocess_target()
) normalize columns with names, occupations, dates, when applicable
- Parameters
goal (
str
) –{'training', 'classification'}
. Whether the dataset is for training or classificationwikidata_reader (
JsonReader
) – a dataset reader as returned bybuild_wikidata()
- Return type
Iterator
[DataFrame
]- Returns
the generator yielding preprocessed
pandas.DataFrame
chunks
blocking
¶
Custom blocking technique for the Record Linkage Toolkit, where blocking stands for record pairs indexing.
In a nutshell, blocking means finding candidate pairs suitable for comparison: this is essential to avoid blind comparison of all records, thus reducing the overall complexity of the task. In a supervised learning scenario, this translates into finding relevant training and classification samples.
Given a Wikidata pandas.Series
(dataset column),
this technique finds samples through
full-text search
in natural language mode against the target catalog database.
Target catalog identifiers of the output pandas.MultiIndex
are also
passed to build_target()
for building the actual target dataset.
-
soweego.linker.blocking.
find_samples
(goal, catalog, wikidata_column, chunk_number, target_db_entity, dir_io)[source]¶ Build a blocking index by looking up target catalog identifiers given a Wikidata dataset column. A meaningful column should hold strings.
Under the hood, run full-text search in natural language mode against the target catalog database.
This function uses multithreaded parallel processing.
- Parameters
goal (
str
) –{'training', 'classification'}
. Whether the samples are for training or classificationcatalog (
str
) –{'discogs', 'imdb', 'musicbrainz'}
. A supported catalogwikidata_column (
Series
) – a Wikidata dataset column holding values suitable for full-text search against the target databasechunk_number (
int
) – which Wikidata chunk will undergo blocking. Typically returned by callingenumerate()
overpreprocess_wikidata()
target_db_entity (~DB_ENTITY) – an ORM entity (AKA table) of the target catalog database that full-text search should aim at
dir_io (
str
) – input/output directory where index chunks will be read/written
- Return type
MultiIndex
- Returns
the blocking index holding candidate pairs
features
¶
A set of custom features suitable for the Record Linkage Toolkit, where feature extraction stands for record pairs comparison.
Input: pairs of list
objects
coming from Wikidata and target catalog pandas.DataFrame
columns as per
preprocess_wikidata()
and
preprocess_target()
output.
Output: a feature vector pandas.Series
.
All classes in this module share the following constructor parameters:
left_on (str) - a Wikidata column label
right_on (str) - a target catalog column label
missing_value - (optional) a score to fill null values
label - (optional) a label for the output feature
Series
Specific parameters are documented in the __init__ method of each class.
All classes in this module implement
recordlinkage.base.BaseCompareFeature
, and can be
added to the feature extractor object recordlinkage.Compare
.
Usage:
>>> import recordlinkage as rl
>>> from soweego.linker import features
>>> extractor = rl.Compare()
>>> source_column, target_column = 'birth_name', 'fullname'
>>> feature = features.ExactMatch(source_column, target_column)
>>> extractor.add(feature)
-
class
soweego.linker.features.
ExactMatch
(left_on, right_on, match_value=1.0, non_match_value=0.0, missing_value=0.0, label=None)[source]¶ Compare pairs of lists through exact match on each pair of elements.
-
__init__
(left_on, right_on, match_value=1.0, non_match_value=0.0, missing_value=0.0, label=None)[source]¶
-
-
class
soweego.linker.features.
SimilarStrings
(left_on, right_on, algorithm='levenshtein', threshold=None, missing_value=0.0, analyzer=None, ngram_range=(2, 2), label=None)[source]¶ Compare pairs of lists holding strings through similarity measures on each pair of elements.
-
__init__
(left_on, right_on, algorithm='levenshtein', threshold=None, missing_value=0.0, analyzer=None, ngram_range=(2, 2), label=None)[source]¶ - Parameters
algorithm (
str
) –(optional)
{'cosine', 'levenshtein'}
. A string similarity algorithm, either the cosine similarity or the Levenshtein distance respectivelythreshold (
Optional
[float
]) – (optional) a threshold to filter features with a lower or equal scoremissing_value (
float
) – (optional) a score to fill null values(optional, only applies when algorithm=’cosine’)
{'soweego', 'word', 'char', 'char_wb'}
. A text analyzer to preprocess input. It is passed to the analyzer parameter ofsklearn.feature_extraction.text.CountVectorizer
.'soweego'
issoweego.commons.text_utils.tokenize()
{'word', 'char', 'char_wb'}
are scikit built-ins. See here for more detailsNone
isstr.split()
, and means input is already preprocessed
ngram_range (
Tuple
[int
,int
]) – (optional, only applies when algorithm=’cosine’ and analyzer is not ‘soweego’). Lower and upper boundary for n-gram extraction, passed toCountVectorizer
label (
Optional
[str
]) – (optional) a label for the output featureSeries
-
-
class
soweego.linker.features.
SimilarDates
(left_on, right_on, missing_value=0.0, label=None)[source]¶ Compare pairs of lists holding dates through match by maximum shared precision.
Compare pairs of lists holding string tokens through weighted intersection.
Compare pairs of lists holding occupation QIDs (ontology classes) through expansion of the class hierarchy, plus intersection of values.
Compare pairs of lists holding string tokens through weighted intersection.
This feature is similar to
SharedTokens
, but has extra functionality:handles arbitrary stop words
accepts nested list of tokens
output score is the percentage of tokens in the smallest set which are shared among both sets
classifiers
¶
A set of custom supervised classifiers suitable for the Record Linkage Toolkit. It includes neural networks and support-vector machines.
All classes implement recordlinkage.base.BaseClassifier
: typically,
you will use its fit()
,
predict()
, and
prob()
methods.
-
class
soweego.linker.classifiers.
GatedEnsembleClassifier
(num_features, **kwargs)[source]¶ Ensemble of classifiers, whose predictions are joined by using a further meta-learner, which decides the final output based on the prediction of the base classifiers.
This classifier uses
mlens.ensemble.SuperLearner
to implement the gating functionality.The parameters, and their default values, are:
- meta_layer: Name of the classifier to use as a meta layer. By
default this is single_layer_perceptron
- folds: The number of folds to use for cross validation when
generating the training set for the meta_layer. The default value for this is 2.
For a better explanation of this parameter, see:
Polley, Eric C. and van der Laan, Mark J., “Super Learner In Prediction” (May 2010). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 266 https://biostats.bepress.com/ucbbiostat/paper266/
-
class
soweego.linker.classifiers.
MultiLayerPerceptron
(num_features, **kwargs)[source]¶ A multi-layer perceptron classifier.
This class implements a keras.Sequential model with the following default architecture:
Dense layer 1, with
128
output dimension andrelu
activation functionBatchNormalization layer
Dense layer 2, with
32
output dimension andrelu
activation functionBatchNormalization layer
Dense layer 3, with
1
output dimension andsigmoid
activation functionadadelta
optimizerbinary_crossentropy
loss functionaccuracy
metric for evaluation
If you want to override the default parameters, you can pass the following keyword arguments to the constructor:
activations - a triple with values for (dense layer 1, dense layer 2, dense layer 3). See available activations
optimizer - see optimizers
loss - see available loss functions
metrics - see available metrics
-
class
soweego.linker.classifiers.
RandomForest
(*args, **kwargs)[source]¶ A Random Forest classifier.
This class implements
sklearn.ensemble.RandomForestClassifier
, and receives the same parameters.It fits multiple decision trees on sub-samples (aka, parts) of the dataset and averages the result to get more accuracy and reduce over-fitting.
The default parameters are:
n_estimators: 500
criterion: entropy
max_features: None
bootstrap: True
-
prob
(feature_vectors)[source]¶ Classify record pairs and include the probability score of being a match.
- Parameters
feature_vectors (
DataFrame
) – aDataFrame
computed via record pairs comparison. This should berecordlinkage.Compare.compute()
output. Seeextract_features()
for more details- Return type
Series
- Returns
the classification results
-
class
soweego.linker.classifiers.
SVCClassifier
(*args, **kwargs)[source]¶ A support-vector machine classifier.
This class implements
sklearn.svm.SVC
, which is based on the libsvm library.This classifier differs from
recordlinkage.classifiers.SVMClassifier
, which implementssklearn.svm.LinearSVC
, based on the liblinear library.Main highlights:
output probability scores
can use non-linear kernels
higher training time (quadratic to the number of samples)
-
prob
(feature_vectors)[source]¶ Classify record pairs and include the probability score of being a match.
- Parameters
feature_vectors (
DataFrame
) – aDataFrame
computed via record pairs comparison. This should berecordlinkage.Compare.compute()
output. Seeextract_features()
for more details- Return type
Series
- Returns
the classification results
-
class
soweego.linker.classifiers.
SingleLayerPerceptron
(num_features, **kwargs)[source]¶ A single-layer perceptron classifier.
This class implements a keras.Sequential model with the following default architecture:
single Dense layer
sigmoid
activation functionadam
optimizerbinary_crossentropy
loss functionaccuracy
metric for evaluation
If you want to override the default parameters, you can pass the following keyword arguments to the constructor:
activation - see available activations
optimizer - see optimizers
loss - see available loss functions
metrics - see available metrics
-
class
soweego.linker.classifiers.
StackedEnsembleClassifier
(num_features, **kwargs)[source]¶ Ensemble of stacked classifiers, meaning that classifiers are arranged in layers with the next layer getting as input the output of the last layer. The predictions of the final layer are merged with a meta-learner (the same happens for ~:class:soweego.linker.GatedEnsembleClassifier), which decides the final output based on the prediction of the base classifiers.
This classifier uses
mlens.ensemble.SuperLearner
to implement the stacking functionality.The parameters, and their default values, are:
- meta_layer: Name of the classifier to use as a meta layer. By
default this is single_layer_perceptron
- folds: The number of folds to use for cross validation when
generating the training set for the meta_layer. The default value for this is 2.
For a better explanation of this parameter, see:
Polley, Eric C. and van der Laan, Mark J., “Super Learner In Prediction” (May 2010). U.C. Berkeley Division of Biostatistics Working Paper Series. Working Paper 266 https://biostats.bepress.com/ucbbiostat/paper266/
-
class
soweego.linker.classifiers.
VotingClassifier
(num_features, **kwargs)[source]¶ A basic ensemble classifier which uses a voting procedure to decide the final outcome of a prediction.
This class implements
sklearn.ensemble.VotingClassifier
.It combines a set of classifiers and uses majority vote or average predicted probabilities to pick the final prediction. See scikit’s user guide.
The parameter voting can have as values either “hard” or “soft”.
- hard - the label predicted by the majority of base classifiers is used as the
final prediction. Note that this does not return probabilities, only the final label.
- soft - the probability that a pair is a match is taken from all base classifiers
and then averaged. This average is what is returned by the classifier.
By default voting=soft.
-
prob
(feature_vectors)[source]¶ Classify record pairs and include the probability score of being a match.
- Parameters
feature_vectors (
DataFrame
) – aDataFrame
computed via record pairs comparison. This should berecordlinkage.Compare.compute()
output. Seeextract_features()
for more details- Return type
Series
- Returns
the classification results
train
¶
Train supervised linking algorithms.
-
soweego.linker.train.
build_training_set
(catalog, entity, dir_io)[source]¶ Build a training set.
- Parameters
- Return type
Tuple
[DataFrame
,MultiIndex
]- Returns
the feature vectors and positive samples pair. Features are computed by comparing (QID, catalog ID) pairs. Positive samples are catalog IDs available in Wikidata
-
soweego.linker.train.
execute
(classifier, catalog, entity, tune, k, dir_io, **kwargs)[source]¶ Train a supervised linker.
Build the training set relevant to the given catalog and entity
train a model with the given classifier
- Parameters
classifier (
str
) –{'naive_bayes', 'linear_support_vector_machines', 'support_vector_machines', 'single_layer_perceptron', 'multi_layer_perceptron'}
. A supported classifiercatalog (
str
) –{'discogs', 'imdb', 'musicbrainz'}
. A supported catalogentity (
str
) –{'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}
. A supported entitytune (
bool
) – whether to run grid search for hyperparameters tuning or notk (
int
) – number of folds for hyperparameters tuning. It is used only when tune=Truedir_io (
str
) – input/output directory where working files will be read/writtenkwargs – extra keyword arguments that will be passed to the model initialization
- Return type
- Returns
the trained model
link
¶
Run supervised linkers.
-
soweego.linker.link.
execute
(model_path, catalog, entity, threshold, name_rule, dir_io)[source]¶ Run a supervised linker.
Build the classification set relevant to the given catalog and entity
generate links between Wikidata items and catalog identifiers
- Parameters
model_path (
str
) – path to a trained model filecatalog (
str
) –{'discogs', 'imdb', 'musicbrainz'}
. A supported catalogentity (
str
) –{'actor', 'band', 'director', 'musician', 'producer', 'writer', 'audiovisual_work', 'musical_work'}
. A supported entitythreshold (
float
) – minimum confidence score for generated links. Those below this value are discarded. Must be a float between 0 and 1name_rule (
bool
) – whether to enable the rule on full names or not: if True, links with different full names are discarded after classificationdir_io (
str
) – input/output directory where working files will be read/written
- Return type
Iterator
[Series
]- Returns
the generator yielding chunks of links