Skip to content

Full text search? #592

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mindplay-dk opened this issue Feb 15, 2017 · 17 comments
Closed

Full text search? #592

mindplay-dk opened this issue Feb 15, 2017 · 17 comments
Milestone

Comments

@mindplay-dk
Copy link

mindplay-dk commented Feb 15, 2017

No plans to support full text search?

There's nothing on the roadmap.

I'm more than a little interested in DGraph as an alternative to PostgreSQL as the back-end of application state, but I recall the days of struggling to integrate something like Solr for full-text search at the application-level, and don't wish to go back there.

Any plans (or community initiative) to integrate a full-text search-engine?

@manishrjain
Copy link
Contributor

Hey @mindplay-dk,

Does term matching not work for you?
https://wiki.dgraph.io/Query_Language#Term_matching

What sort of queries do you intend to run?

@mindplay-dk
Copy link
Author

What sort of queries do you intend to run?

Well, full-text search against article body content - presumably that would require a full-text search engine with language support and features such as stemming, stopwords, synonyms, etc. and special index-types.

It's not clear to me how indexing works - simply adding @index doesn't specify an index-type, so there's currently no way to specify index-types to optimize for different access patterns and query-types etc.?

You have just one type of index for all data-types? How does that work?

@manishrjain
Copy link
Contributor

@mindplay-dk:

We have different tokenizers for different data types. We chose those tokenizers automatically. This is very basic, in the sense that you only get one tokenizer per data type.

Now We have changed that to allow multiple tokenizers for the same data type. So, you can specify the tokenizer you want. It would be part of the release we're doing today. Currently, the string tokenizers we have is for breaking a phrase into terms and doing equality matches.

We could add more tokenizers to handle English language better with stemming, stopwords, etc. But then that won't scale to other languages.

We could allow a way by which the users can write their own tokenizer, specify that in the index, and we can use that. How does that sound?

@mindplay-dk
Copy link
Author

We could allow a way by which the users can write their own tokenizer, specify that in the index, and we can use that. How does that sound?

Full text search is not only a matter of tokenization, the search itself requires processing of the search terms. Stemming, removal of stop words, dictionary replacements, and so on.

If you want to provide real text search, you should consider integrating a real FTS engine - there are, for example, numerous C ports of Lucene. We would need a language-annotation that can be applied at the field level.

Attempting to roll your own likely only means we'll need to integrate with a stand-alone search package (solr etc) at the application level, to get the search quality users expect today, which is really clumsy - FTS is a really big and complex domain, and, in my opinion, not where you should be investing your time; pick an available open source library and integrate that instead, it's a much better use of your time.

@manishrjain
Copy link
Contributor

Full text search is not only a matter of tokenization, the search itself requires processing of the search terms.

We do that. The tokenizer is applied on both the data and the search query. Otherwise, good point.

@tzdybal : Can you look into Bleve and see if we could use certain packages from them. We don't want to use their data storage layer; but only the layer which analyzes the languages, generates the tokens, and finally do the same on the query level. This would also allow us to cut off ICU, and go native Go.

@joeblew99
Copy link

I would also suggest bleve be looked at.
The FtS engine of couch base uses bleve to make a cluster ready FTS.

Bleve can use numerous data stores too.

There are two ways to approach this too. You can run bleve seperate from dgraph.
Then whenever you mutate data in dgraph pass it to bleve to do whatever index mapping you need. So when a FTS is required you call into bleve and it will return matching record IDs stored in dgraph.

Either way I have found bleve to be fantastic and highly supported by couch base too

@mindplay-dk
Copy link
Author

So when a FTS is required you call into bleve and it will return matching record IDs stored in dgraph.

This is what I would call a high-level integration.

I'd suggest a more low-level integration - you should be able to plan the net query better, if you can assess in advance the dimensions of other indices (of other fields) involved in the query, etc.

While a high-level and low-level integration will likely provide the same convenience an client-facing features, it will likely have net performance similar to an appliation-level integration with an external FTS service - whereas a low-level integration might be able to make some optimizations we can't make at the application-level.

@manishrjain
Copy link
Contributor

Bleve would need to be integrated in a way, where we still control how the data gets stored. We use our own mechanism for data storage, and all we need from Bleve is to do the tokenization for us, taking into account porter stemming, stop words, what not. So, we need the library part of Bleve; not the storage part.

Running anything outside of Dgraph is out of the question.

@joeblew99
Copy link

@manishrjain @mindplay-dk
I agree fully ,and it would be awesome. DGraph needs only the library part.

i would be a happy chappy if the facets part of bleve is included too. Its very powerful.
In terms of GUI, its an amazingly useful way to search for data when you have a ton of it.

@mindplay-dk
Copy link
Author

@joeblew99 facets would be killer, but doesn't need to arrive with the first feature release :-)

@joeblew99
Copy link

joeblew99 commented Feb 21, 2017

@mindplay-dk

Its actually a tiny amount of code:
https://github.com/blevesearch/bleve/tree/master/search/facet

https://github.com/blevesearch/bleve/search?p=1&q=facet

@tzdybal
Copy link
Contributor

tzdybal commented Feb 22, 2017

I looked through the code of bleve. The separation of concerns is clear, packages are very fine-grained, API seems reusable. It's easy to select only some of the functionalities.

All we need is tokenizer (probably Unicode) and some token filters. Natural candidates are: Lowercase, Stemmer, Stop Token.

@manishrjain: Replacing current ICU tokenizer with bleve based solution (tokenizer+filters) should be straightforward.

@joeblew99 @mindplay-dk
From the code-level perspective, integration of facets building logic is definitely possible.

@manishrjain
Copy link
Contributor

This looks awesome. Let's get on it.

@tzdybal
Copy link
Contributor

tzdybal commented Mar 17, 2017

Bleve integrated for full text search. Changes available in master.
Implemented features:

  • new functions for FTS matching
  • tokenization, UTF-normalization, stemming, stop words
  • support for multiple languages (stemmers and stop words lists)

For more FTS-related features please feel free to open new github issues.

@tzdybal tzdybal closed this as completed Mar 17, 2017
@mindplay-dk
Copy link
Author

Exciting news! :-)

Documentation updates pending?

@manishrjain
Copy link
Contributor

Sorry, didn't notice this question. Here're the docs:
https://docs.dgraph.io/v0.7.5/query-language/#full-text-search

@dahankzter
Copy link

Fantastic!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants