Commit eaf8963
committed
feat(dedupe): Handle Geonames records with overridden parents
Background
==========
In pelias/geonames#93 we added some special case
logic to the Geonames importer that ensures Geonames records in the
`locality` and `localadmin` layer have themselves as parents in that
layer.
Before this change, they would have a Who's on First parent, but these
parents didn't always line up perfectly. Sometimes it would lead to
broken labels, and as I recall it could also break search queries that
rely on locality/localadmin names.
Hierarchy checks
================
This special logic causes problems with our hierarchy checks, which
expect records that can be considered duplicates to share all parents
higher than the _lower_ record.
So for example, if a locality and localadmin are to be considered
duplicates, the hierarchy must be the same from the country layer down
to localadmin.
Geonames localadmins
====================
Geonames seems to have a penchant for having both a `locality` _and_ a
`localadmin` record for a given city, even when the local administrative
divisions don't really support such nuance.
These records often have a name following the format 'City of X', which
makes them very disruptive and confusing when shown in a list of
results.
Deduplication
=============
Our deduplication code can handle minor name differences like 'City of'
after #1371, but can't handle the
hierarchy differences that generally occur with these records.
Generally, there will be one of two scenarios:
- A WOF locality record for the city can't deduplicate with the Geonames
localadmin because the WOF record is parented by a WOF localadmin
- A WOF locality record for the city can't deduplicate with the Geonames
localadmin beause the WOF record has no localadmin parent at all
Concordances (from #1606) generally
don't help either, since ther often isn't a localadmin in WOF to even
have a concordance to the Geonames localadmin.
Adding a hierarchy exception
============================
This PR works by skipping the hierarchy checks for any layer
where a Geonames record has itself as a parent. This means that assuming
all the other layers are the same, the names are compatible, etc,
deduplication is still possible.
Impact
======
Of the 314 cities in our
[`top_us_cities`](https://github.com/pelias/fuzzy-tests/blob/master/test_cases/top_us_cities.json)
fuzzy tests, most of them (125) had a 'City of X' record
somewhere in the results when querying via the autocomplete endpoint.
With this PR, there are only 15 cases.
Potential regressions
=====================
Theoretically, this could allow records that aren't actually duplicates
to be deduped, but they would have to have a similar name and likely
share at least a `county`, so it feels like the chance for error is
limited.1 parent 533884c commit eaf8963
2 files changed
+57
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
46 | 46 | | |
47 | 47 | | |
48 | 48 | | |
| 49 | + | |
| 50 | + | |
| 51 | + | |
| 52 | + | |
| 53 | + | |
| 54 | + | |
| 55 | + | |
| 56 | + | |
| 57 | + | |
| 58 | + | |
| 59 | + | |
| 60 | + | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
| 65 | + | |
| 66 | + | |
49 | 67 | | |
50 | 68 | | |
51 | 69 | | |
| |||
108 | 126 | | |
109 | 127 | | |
110 | 128 | | |
| 129 | + | |
| 130 | + | |
| 131 | + | |
| 132 | + | |
| 133 | + | |
111 | 134 | | |
112 | 135 | | |
113 | 136 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
790 | 790 | | |
791 | 791 | | |
792 | 792 | | |
| 793 | + | |
| 794 | + | |
| 795 | + | |
| 796 | + | |
| 797 | + | |
| 798 | + | |
| 799 | + | |
| 800 | + | |
| 801 | + | |
| 802 | + | |
| 803 | + | |
| 804 | + | |
| 805 | + | |
| 806 | + | |
| 807 | + | |
| 808 | + | |
| 809 | + | |
| 810 | + | |
| 811 | + | |
| 812 | + | |
| 813 | + | |
| 814 | + | |
| 815 | + | |
| 816 | + | |
| 817 | + | |
| 818 | + | |
| 819 | + | |
| 820 | + | |
| 821 | + | |
| 822 | + | |
| 823 | + | |
| 824 | + | |
| 825 | + | |
| 826 | + | |
793 | 827 | | |
794 | 828 | | |
795 | 829 | | |
| |||
0 commit comments