Skip to content

Romanian company addresses parse issues #685

@maxbasmanov

Description

@maxbasmanov

I have installed latest version as of today
(ee7aa9a 5 days ago)
with senzing model and tested with some romanian company addresses.

And returns contain several issues:
Șos Mihai Bravu, 1, Bl:2, Sc:c, Et:12, Ap:129, -, București, Sect 2
road: șos mihai bravu
house_number: 1 bl 2 sc:c
level: et 12
unit: ap 129
road: bucurești sect 2 - it should be city and suburb

Șos Mihai Bravu, 1, Bl:2, Sc:c, Et:12, Ap:129, Compartiment 2, București, Sect 2
road: șos mihai bravu
house_number: 1 bl 2 sc:c
level: et 12
unit: ap 129
house: compartiment 2 bucurești - bucurești should be city
road: sect 2 - sect 2 should be suburb

P-ța Emanuil Gojdu, 37, Bl:a5, Parter, Oradea
house: p-ța - p-ța should be part of road (https://github.com/openvenues/libpostal/blob/master/resources/dictionaries/ro/street_types.txt)
road: emanuil gojdu
house_number: 37
road: bl:a - should be bl:a5
house_number: 5 - should be part of previous line
level: parter
city: oradea

B-dul Pipera, 1/i, Et:7, Constructia C2, Biroul Nr.10, Compartiment 59, Oraș Voluntari
road: b-dul pipera
house_number: 1/i
level: et 7
house: constructia c2 biroul nr.10 compartiment 59 oraș - oraș is just "city"
city: voluntari

same address:
Oraş Voluntari, B-dul PIPERA, Nr. 1/I, CONSTRUCTIA C2, BIROUL NR.10, COMPARTIMENT 59, Etaj 7, Județ Ilfov, Cod poștal 77190
house: oraş voluntari - Oraş Voluntari is City Voluntari
road: b-dul pipera
house_number: nr. 1/i
house: constructia c2 biroul nr.10 compartiment
house_number: 59
level: etaj 7
house: județ ilfov cod poștal - județ is suburb, "cod poștal" should not be here
postcode: 77190

Following two are pretty good:
Str. Eufrosina Popescu, 46, -, București, Sect 3
road: str. eufrosina popescu
house_number: 46
city: bucurești
suburb: sect 3

Str. Balta Albina, 4, Et:1, Inedit Building, București, Sect 3
road: str. balta albina
house_number: 4
level: et 1
house: inedit building
city: bucurești
suburb: sect 3

The most confusing part of it is several "house_number" or "road" or "house" items in parsed data which make it very difficult to differentiate later.

Romanian company DB can be found here (open data):
working companies:
https://data.gov.ro/dataset/firme-inregistrate-la-registrul-comertului-pana-la-data-de-18-decembrie-2024/resource/3043787a-832a-4ccc-9712-f10da0092e14?inner_span=True

closed companies:
https://data.gov.ro/dataset/firme-inregistrate-la-registrul-comertului-pana-la-data-de-18-decembrie-2024/resource/88a7eaae-32c3-4d0c-9024-2bb846cb0bc4?inner_span=True

There are tons of addresses to train the model.

Looking forward for support.
Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions