Skip to content

How to reduce duplication of Works #127

@VladimirAlexiev

Description

@VladimirAlexiev

This query about SemOpenAlex works:

PREFIX dct: <http://purl.org/dc/terms/>
PREFIX Service: <http://www.metaphacts.com/ontologies/platform/service/>
PREFIX entitylookup: <http://www.metaphacts.com/ontologies/platform/service/entitylookup/>
SELECT * WHERE {
  SERVICE Service:entityLookup {
    ?subject entitylookup:entityName "semopen";
             entitylookup:limit 100 ;
             entitylookup:score ?score;
             entitylookup:rank ?rank.
  }
  ?subject dct:title ?title
} ORDER BY DESC (?score) DESC (?rank) 

Returns 7 Works that are 3 "true" works, plus 4 variants/versions thereof:

subject rank title comment
https://semopenalex.org/work/W4388144113 90.0 SemOpenAlex: The Scientific Landscape in 26 Billion RDF Triples published
https://semopenalex.org/work/W4385682125 45.0 SemOpenAlex: The Scientific Landscape in 26 Billion RDF Triples arxiv preprint
https://semopenalex.org/work/W4393797504 38.0 SemOpenAlex Embeddings Zenodo "all versions"
https://semopenalex.org/work/W4393798967 38.0 SemOpenAlex Embeddings Zenodo new version
https://semopenalex.org/work/W4393895619 38.0 SemOpenAlex Embeddings Zenodo old version
https://semopenalex.org/work/W4393691335 36.0 RDF Knowledge Graph SemOpenAlex-SemanticWeb Zenodo "all versions"
https://semopenalex.org/work/W4393746966 35.0 RDF Knowledge Graph SemOpenAlex-SemanticWeb Zenodo only version

If this pattern holds, then over half of all Works in SOA are duplicates.
The problem is especially galling for the last case: every Zenodo resource, even if it has no versions at all, is represented by two Zenodo DOIs and consequently twice in SOA.

I know that these problems come from OpenAlex but I don't know where to report them.
I also understand that deduplicating works is not easy, eg the arxiv preprint doesn't state the DOI of the published version.
But maybe at least you can remove the Zenodo "all versions" URL/DOI, and maybe all old versions?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions