-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Full text search? #592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hey @mindplay-dk, Does term matching not work for you? What sort of queries do you intend to run? |
Well, full-text search against article body content - presumably that would require a full-text search engine with language support and features such as stemming, stopwords, synonyms, etc. and special index-types. It's not clear to me how indexing works - simply adding You have just one type of index for all data-types? How does that work? |
We have different tokenizers for different data types. We chose those tokenizers automatically. This is very basic, in the sense that you only get one tokenizer per data type. Now We have changed that to allow multiple tokenizers for the same data type. So, you can specify the tokenizer you want. It would be part of the release we're doing today. Currently, the string tokenizers we have is for breaking a phrase into terms and doing equality matches. We could add more tokenizers to handle English language better with stemming, stopwords, etc. But then that won't scale to other languages. We could allow a way by which the users can write their own tokenizer, specify that in the index, and we can use that. How does that sound? |
Full text search is not only a matter of tokenization, the search itself requires processing of the search terms. Stemming, removal of stop words, dictionary replacements, and so on. If you want to provide real text search, you should consider integrating a real FTS engine - there are, for example, numerous C ports of Lucene. We would need a language-annotation that can be applied at the field level. Attempting to roll your own likely only means we'll need to integrate with a stand-alone search package (solr etc) at the application level, to get the search quality users expect today, which is really clumsy - FTS is a really big and complex domain, and, in my opinion, not where you should be investing your time; pick an available open source library and integrate that instead, it's a much better use of your time. |
We do that. The tokenizer is applied on both the data and the search query. Otherwise, good point. @tzdybal : Can you look into Bleve and see if we could use certain packages from them. We don't want to use their data storage layer; but only the layer which analyzes the languages, generates the tokens, and finally do the same on the query level. This would also allow us to cut off ICU, and go native Go. |
I would also suggest bleve be looked at. Bleve can use numerous data stores too. There are two ways to approach this too. You can run bleve seperate from dgraph. Either way I have found bleve to be fantastic and highly supported by couch base too |
This is what I would call a high-level integration. I'd suggest a more low-level integration - you should be able to plan the net query better, if you can assess in advance the dimensions of other indices (of other fields) involved in the query, etc. While a high-level and low-level integration will likely provide the same convenience an client-facing features, it will likely have net performance similar to an appliation-level integration with an external FTS service - whereas a low-level integration might be able to make some optimizations we can't make at the application-level. |
Bleve would need to be integrated in a way, where we still control how the data gets stored. We use our own mechanism for data storage, and all we need from Bleve is to do the tokenization for us, taking into account porter stemming, stop words, what not. So, we need the library part of Bleve; not the storage part. Running anything outside of Dgraph is out of the question. |
@manishrjain @mindplay-dk i would be a happy chappy if the facets part of bleve is included too. Its very powerful. |
@joeblew99 facets would be killer, but doesn't need to arrive with the first feature release :-) |
I looked through the code of bleve. The separation of concerns is clear, packages are very fine-grained, API seems reusable. It's easy to select only some of the functionalities. All we need is tokenizer (probably Unicode) and some token filters. Natural candidates are: Lowercase, Stemmer, Stop Token. @manishrjain: Replacing current ICU tokenizer with bleve based solution (tokenizer+filters) should be straightforward. @joeblew99 @mindplay-dk |
This looks awesome. Let's get on it. |
Bleve integrated for full text search. Changes available in master.
For more FTS-related features please feel free to open new github issues. |
Exciting news! :-) Documentation updates pending? |
Sorry, didn't notice this question. Here're the docs: |
Fantastic! |
No plans to support full text search?
There's nothing on the roadmap.
I'm more than a little interested in DGraph as an alternative to PostgreSQL as the back-end of application state, but I recall the days of struggling to integrate something like Solr for full-text search at the application-level, and don't wish to go back there.
Any plans (or community initiative) to integrate a full-text search-engine?
The text was updated successfully, but these errors were encountered: