Skip to content

Inverted index for ketword search within bigger text fields #7

@neeraj9

Description

@neeraj9

Lets say you have an attribute within your data which is a blob of text. For you to search through that text is a full-scan at present in egraphdb. In order to avoid a full-scan which is impractical, how about creating a simple inverted index from that text and making it keyword-searchable.

Potential Steps:

  • Tokenize
  • Drop common words and retain only the useful ones. Lets just say create another table which shall have such words, which can then be loaded by egraphdb in memory for quick access.
  • Simple spelling correction would be useful too.
  • Store multiple rows {keyword, sourceid} for the same data within the index table for a particular attribute. Where you could say do "select count(keyword),sum(count),sourceid from xyz where keyword in ('a', 'b') group by sourceid limit 10000". This is just a suggestion and not a strong rule.

sample table:

create table `egraph_lookup_rindex_base` (
  `key_data` varbinary(255) NOT NULL,
  `id` binary(8) NOT NULL,
  `count` int NOT NULL COMMENT "number of occurrence of keyword in id",
  CONSTRAINT pkey PRIMARY KEY (`id`, `key_data`),
  KEY `key_data` (`key_data`),
  KEY `id` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions