Skip to content

Commit eaf8963

Browse files
committed
feat(dedupe): Handle Geonames records with overridden parents
Background ========== In pelias/geonames#93 we added some special case logic to the Geonames importer that ensures Geonames records in the `locality` and `localadmin` layer have themselves as parents in that layer. Before this change, they would have a Who's on First parent, but these parents didn't always line up perfectly. Sometimes it would lead to broken labels, and as I recall it could also break search queries that rely on locality/localadmin names. Hierarchy checks ================ This special logic causes problems with our hierarchy checks, which expect records that can be considered duplicates to share all parents higher than the _lower_ record. So for example, if a locality and localadmin are to be considered duplicates, the hierarchy must be the same from the country layer down to localadmin. Geonames localadmins ==================== Geonames seems to have a penchant for having both a `locality` _and_ a `localadmin` record for a given city, even when the local administrative divisions don't really support such nuance. These records often have a name following the format 'City of X', which makes them very disruptive and confusing when shown in a list of results. Deduplication ============= Our deduplication code can handle minor name differences like 'City of' after #1371, but can't handle the hierarchy differences that generally occur with these records. Generally, there will be one of two scenarios: - A WOF locality record for the city can't deduplicate with the Geonames localadmin because the WOF record is parented by a WOF localadmin - A WOF locality record for the city can't deduplicate with the Geonames localadmin beause the WOF record has no localadmin parent at all Concordances (from #1606) generally don't help either, since ther often isn't a localadmin in WOF to even have a concordance to the Geonames localadmin. Adding a hierarchy exception ============================ This PR works by skipping the hierarchy checks for any layer where a Geonames record has itself as a parent. This means that assuming all the other layers are the same, the names are compatible, etc, deduplication is still possible. Impact ====== Of the 314 cities in our [`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json) fuzzy tests, most of them (125) had a 'City of X' record somewhere in the results when querying via the autocomplete endpoint. With this PR, there are only 15 cases. Potential regressions ===================== Theoretically, this could allow records that aren't actually duplicates to be deduped, but they would have to have a similar name and likely share at least a `county`, so it feels like the chance for error is limited.
1 parent 533884c commit eaf8963

File tree

2 files changed

+57
-0
lines changed

2 files changed

+57
-0
lines changed

helper/diffPlaces.js

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,24 @@ function isUsState(item) {
4646
return item.parent.country_a[0] === 'USA' && item.layer === 'region';
4747
}
4848

49+
// Geonames records in the locality and localadmin layer are parented by themselves
50+
// This breaks our other hierarchy logic, so check for this special case
51+
function isGeonamesWithSelfParent(item, placeType) {
52+
if (item.source !== 'geonames') { return false; }
53+
if (item.layer !== placeType) { return false; }
54+
55+
// get the relevant parent id(s) for the placeType in question
56+
const parent_records = item.parent[`${placeType}_id`];
57+
58+
// check if the parent ids at this layer match this Geonames record
59+
// we have special cased Geonames parents in many cases
60+
// handle both array and scalar values
61+
if (item.source_id === parent_records) { return true; }
62+
if (parent_records.includes(item.source_id)) { return true; }
63+
64+
return false;
65+
}
66+
4967
/**
5068
* Compare the parent properties if they exist.
5169
* Returns false if the objects are the same, else true.
@@ -108,6 +126,11 @@ function isParentHierarchyDifferent(item1, item2){
108126
// skip layers that are more granular than $maxRank
109127
if (rank > maxRank){ return false; }
110128

129+
// Special case Geonames records that are parented by themselves and would otherwise break these checks
130+
if (isGeonamesWithSelfParent(item1, placeType) || isGeonamesWithSelfParent(item2, placeType)) {
131+
return false;
132+
}
133+
111134
// ensure the parent ids are the same for all placetypes
112135
return isPropertyDifferent( parent1, parent2, `${placeType}_id` );
113136
});

test/unit/helper/diffPlaces.js

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -790,6 +790,40 @@ module.exports.tests.geonames = function (test, common) {
790790
});
791791
};
792792

793+
module.exports.tests.geonames_self_parent = function (test, common) {
794+
test('geonames records that parent themselves should not break hierarchy checks', function(t) {
795+
const gn_record = {
796+
source: 'geonames',
797+
layer: 'localadmin',
798+
source_id: '7163824',
799+
name: {
800+
default: 'City of New York'
801+
},
802+
parent: {
803+
country_id: ['85633793'], // USA, WOF
804+
region_id: ['85688543'], // NY, WOF
805+
localadmin_id: ['7163824'] // New York County, Geonames
806+
}
807+
};
808+
const wof_record = {
809+
source: 'whosonfirst',
810+
layer: 'locality',
811+
source_id: '85977539',
812+
name: {
813+
default: 'New York'
814+
},
815+
parent: {
816+
country_id: ['85633793'], // USA, WOF
817+
region_id: ['85688543'], // NY, WOF
818+
//localadmin_id: null // no localadmin is set for NYC in WOF
819+
}
820+
};
821+
822+
t.false(isDifferent(gn_record, wof_record), 'should be the same based on hierarchy');
823+
t.end();
824+
});
825+
};
826+
793827
module.exports.all = function (tape, common) {
794828

795829
function test(name, testFunction) {

0 commit comments

Comments
 (0)