Skip to content

DBSTREAM incorrect labelling of noisy micro-clusters #1730

@th3sh3ph3rd

Description

@th3sh3ph3rd

Versions

river version: 0.23.0
Python version: 3.12.8
Operating system: Ubuntu 24.04.3 LTS

Describe the bug

The original DBSTREAM paper defines noisy micro-clusters as follows:
Image
However, the current River implementation of DBSTREAM insufficiently handles of noisy micro-clusters. If noisy micro-clusters are not subject to cleanup, they are straight up included in the list of clusters, which is not in line with the behaviour outlined by the original paper.

I propose the following fix:
Add

if self._micro_clusters[index].weight < self.minimum_weight:
    continue

after line 332 in the DBSTREAM implementation. This ensures that noisy micro-clusters are not labelled as clusters.

Steps/code to reproduce

The following code outputs 3 clusters, even though their respective weights are clearly below the minimum_weight threshold, qualifying them as noisy micro-clusters.

from river import cluster
from river import stream

X = [
    [0, 0], [50, 50], [100, 100]
]

dbstream = cluster.DBSTREAM(
    clustering_threshold=1.0,
    fading_factor=0.001,
    cleanup_interval=10,
    intersection_factor=0.3,
    minimum_weight=5
)

for x, _ in stream.iter_array(X):
    dbstream.learn_one(x)

for _, c in dbstream.clusters.items():
    print(f"center: {c.center}, weight: {c.weight}")

Output:

center: {0: 0, 1: 0}, weight: 1
center: {0: 50, 1: 50}, weight: 1
center: {0: 100, 1: 100}, weight: 1

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions