Full text search? #592

mindplay-dk · 2017-02-15T09:46:06Z

No plans to support full text search?

There's nothing on the roadmap.

I'm more than a little interested in DGraph as an alternative to PostgreSQL as the back-end of application state, but I recall the days of struggling to integrate something like Solr for full-text search at the application-level, and don't wish to go back there.

Any plans (or community initiative) to integrate a full-text search-engine?

manishrjain · 2017-02-16T08:29:11Z

Hey @mindplay-dk,

Does term matching not work for you?
https://wiki.dgraph.io/Query_Language#Term_matching

What sort of queries do you intend to run?

mindplay-dk · 2017-02-16T20:36:26Z

What sort of queries do you intend to run?

Well, full-text search against article body content - presumably that would require a full-text search engine with language support and features such as stemming, stopwords, synonyms, etc. and special index-types.

It's not clear to me how indexing works - simply adding @index doesn't specify an index-type, so there's currently no way to specify index-types to optimize for different access patterns and query-types etc.?

You have just one type of index for all data-types? How does that work?

manishrjain · 2017-02-21T05:57:40Z

@mindplay-dk:

We have different tokenizers for different data types. We chose those tokenizers automatically. This is very basic, in the sense that you only get one tokenizer per data type.

Now We have changed that to allow multiple tokenizers for the same data type. So, you can specify the tokenizer you want. It would be part of the release we're doing today. Currently, the string tokenizers we have is for breaking a phrase into terms and doing equality matches.

We could add more tokenizers to handle English language better with stemming, stopwords, etc. But then that won't scale to other languages.

We could allow a way by which the users can write their own tokenizer, specify that in the index, and we can use that. How does that sound?

mindplay-dk · 2017-02-21T07:05:38Z

We could allow a way by which the users can write their own tokenizer, specify that in the index, and we can use that. How does that sound?

Full text search is not only a matter of tokenization, the search itself requires processing of the search terms. Stemming, removal of stop words, dictionary replacements, and so on.

If you want to provide real text search, you should consider integrating a real FTS engine - there are, for example, numerous C ports of Lucene. We would need a language-annotation that can be applied at the field level.

Attempting to roll your own likely only means we'll need to integrate with a stand-alone search package (solr etc) at the application level, to get the search quality users expect today, which is really clumsy - FTS is a really big and complex domain, and, in my opinion, not where you should be investing your time; pick an available open source library and integrate that instead, it's a much better use of your time.

manishrjain · 2017-02-21T07:17:31Z

Full text search is not only a matter of tokenization, the search itself requires processing of the search terms.

We do that. The tokenizer is applied on both the data and the search query. Otherwise, good point.

@tzdybal : Can you look into Bleve and see if we could use certain packages from them. We don't want to use their data storage layer; but only the layer which analyzes the languages, generates the tokens, and finally do the same on the query level. This would also allow us to cut off ICU, and go native Go.

joeblew99 · 2017-02-21T08:03:27Z

I would also suggest bleve be looked at.
The FtS engine of couch base uses bleve to make a cluster ready FTS.

Bleve can use numerous data stores too.

There are two ways to approach this too. You can run bleve seperate from dgraph.
Then whenever you mutate data in dgraph pass it to bleve to do whatever index mapping you need. So when a FTS is required you call into bleve and it will return matching record IDs stored in dgraph.

Either way I have found bleve to be fantastic and highly supported by couch base too

mindplay-dk · 2017-02-21T09:15:00Z

So when a FTS is required you call into bleve and it will return matching record IDs stored in dgraph.

This is what I would call a high-level integration.

I'd suggest a more low-level integration - you should be able to plan the net query better, if you can assess in advance the dimensions of other indices (of other fields) involved in the query, etc.

While a high-level and low-level integration will likely provide the same convenience an client-facing features, it will likely have net performance similar to an appliation-level integration with an external FTS service - whereas a low-level integration might be able to make some optimizations we can't make at the application-level.

manishrjain · 2017-02-21T10:39:24Z

Bleve would need to be integrated in a way, where we still control how the data gets stored. We use our own mechanism for data storage, and all we need from Bleve is to do the tokenization for us, taking into account porter stemming, stop words, what not. So, we need the library part of Bleve; not the storage part.

Running anything outside of Dgraph is out of the question.

joeblew99 · 2017-02-21T21:01:18Z

@manishrjain @mindplay-dk
I agree fully ,and it would be awesome. DGraph needs only the library part.

i would be a happy chappy if the facets part of bleve is included too. Its very powerful.
In terms of GUI, its an amazingly useful way to search for data when you have a ton of it.

mindplay-dk · 2017-02-21T21:09:10Z

@joeblew99 facets would be killer, but doesn't need to arrive with the first feature release :-)

joeblew99 · 2017-02-21T21:26:02Z

@mindplay-dk

Its actually a tiny amount of code:
https://github.com/blevesearch/bleve/tree/master/search/facet

https://github.com/blevesearch/bleve/search?p=1&q=facet

tzdybal · 2017-02-22T05:12:43Z

I looked through the code of bleve. The separation of concerns is clear, packages are very fine-grained, API seems reusable. It's easy to select only some of the functionalities.

All we need is tokenizer (probably Unicode) and some token filters. Natural candidates are: Lowercase, Stemmer, Stop Token.

@manishrjain: Replacing current ICU tokenizer with bleve based solution (tokenizer+filters) should be straightforward.

@joeblew99 @mindplay-dk
From the code-level perspective, integration of facets building logic is definitely possible.

manishrjain · 2017-02-23T05:41:45Z

This looks awesome. Let's get on it.

tzdybal · 2017-03-17T04:42:34Z

Bleve integrated for full text search. Changes available in master.
Implemented features:

new functions for FTS matching
tokenization, UTF-normalization, stemming, stop words
support for multiple languages (stemmers and stop words lists)

For more FTS-related features please feel free to open new github issues.

mindplay-dk · 2017-03-17T16:06:54Z

Exciting news! :-)

Documentation updates pending?

manishrjain · 2017-04-21T05:28:32Z

Sorry, didn't notice this question. Here're the docs:
https://docs.dgraph.io/v0.7.5/query-language/#full-text-search

dahankzter · 2017-05-30T21:07:09Z

Fantastic!

manishrjain added the author to reply label Feb 16, 2017

manishrjain added this to the v0.8 milestone Feb 23, 2017

manishrjain added improvement and removed author to reply labels Feb 23, 2017

manishrjain self-assigned this Feb 23, 2017

tzdybal assigned tzdybal and unassigned manishrjain Feb 23, 2017

This was referenced Mar 14, 2017

Feature Suggestion: Use bleve as Search Engine ipfs-search/ipfs-search#22

Closed

feature request: full text and fact search using bleve ipfs-cluster/ipfs-cluster#68

Closed

tzdybal closed this as completed Mar 17, 2017

ghost mentioned this issue Jan 22, 2018

Refactor to a distributed kv backend? ponzu-cms/ponzu#202

Closed

manishrjain unassigned tzdybal Mar 21, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Full text search? #592

Full text search? #592

mindplay-dk commented Feb 15, 2017 •

edited by manishrjain

Loading

manishrjain commented Feb 16, 2017

mindplay-dk commented Feb 16, 2017

manishrjain commented Feb 21, 2017

mindplay-dk commented Feb 21, 2017

manishrjain commented Feb 21, 2017

joeblew99 commented Feb 21, 2017

mindplay-dk commented Feb 21, 2017

manishrjain commented Feb 21, 2017

joeblew99 commented Feb 21, 2017

mindplay-dk commented Feb 21, 2017

joeblew99 commented Feb 21, 2017 •

edited

Loading

tzdybal commented Feb 22, 2017

manishrjain commented Feb 23, 2017

tzdybal commented Mar 17, 2017

mindplay-dk commented Mar 17, 2017

manishrjain commented Apr 21, 2017

dahankzter commented May 30, 2017

Full text search? #592

Full text search? #592

Comments

mindplay-dk commented Feb 15, 2017 • edited by manishrjain Loading

manishrjain commented Feb 16, 2017

mindplay-dk commented Feb 16, 2017

manishrjain commented Feb 21, 2017

mindplay-dk commented Feb 21, 2017

manishrjain commented Feb 21, 2017

joeblew99 commented Feb 21, 2017

mindplay-dk commented Feb 21, 2017

manishrjain commented Feb 21, 2017

joeblew99 commented Feb 21, 2017

mindplay-dk commented Feb 21, 2017

joeblew99 commented Feb 21, 2017 • edited Loading

tzdybal commented Feb 22, 2017

manishrjain commented Feb 23, 2017

tzdybal commented Mar 17, 2017

mindplay-dk commented Mar 17, 2017

manishrjain commented Apr 21, 2017

dahankzter commented May 30, 2017

mindplay-dk commented Feb 15, 2017 •

edited by manishrjain

Loading

joeblew99 commented Feb 21, 2017 •

edited

Loading