Inverted index for ketword search within bigger text fields

Lets say you have an attribute within your data which is a blob of text. For you to search through that text is a full-scan at present in egraphdb. In order to avoid a full-scan which is impractical, how about creating a simple inverted index from that text and making it keyword-searchable.

Potential Steps:

* Tokenize
* Drop common words and retain only the useful ones. Lets just say create another table which shall have such words, which can then be loaded by egraphdb in memory for quick access.
* Simple spelling correction would be useful too.
* Store multiple rows {keyword, sourceid} for the same data within the index table for a particular attribute. Where you could say do "select count(keyword),sum(count),sourceid from xyz where keyword in ('a', 'b') group by sourceid limit 10000". This is just a suggestion and not a strong rule.

sample table:

```sql

create table `egraph_lookup_rindex_base` (
  `key_data` varbinary(255) NOT NULL,
  `id` binary(8) NOT NULL,
  `count` int NOT NULL COMMENT "number of occurrence of keyword in id",
  CONSTRAINT pkey PRIMARY KEY (`id`, `key_data`),
  KEY `key_data` (`key_data`),
  KEY `id` (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inverted index for ketword search within bigger text fields #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Inverted index for ketword search within bigger text fields #7

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions