Skip to content

Support searching for an exact phrase #2850

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pv opened this issue Jan 26, 2018 · 19 comments
Closed

Support searching for an exact phrase #2850

pv opened this issue Jan 26, 2018 · 19 comments
Labels
feature request good first issue This issue is ideal for first-time contributors! search Opensearch, search filters, and so on usability

Comments

@pv
Copy link
Contributor

pv commented Jan 26, 2018

The search box does not appear to have a way to search for a specific exact phrase.

Based on how other search engines work, putting a phrase in quotes should require an exact phrase match.

However, currently searches with and without quotes produce the same result:

https://pypi.org/search/?q=%22Image+processing+routines+for+SciPy%22
and
https://pypi.org/search/?q=Image+processing+routines+for+SciPy

I would have expected the first one with quotes to only produce the result containing the exact phrase.


Good First Issue: This issue is good for first time contributors. If there is not a corresponding pull request for this issue, it is up for grabs. For directions for getting set up, see our Getting Started Guide. If you are working on this issue and have questions, please feel free to ask them here, #pypa-dev on Freenode, or the pypa-dev mailing list.

@di di added feature request search Opensearch, search filters, and so on labels Jan 26, 2018
@brainwane brainwane added this to the 3: Publicize beta milestone Feb 12, 2018
@brainwane brainwane added good first issue This issue is ideal for first-time contributors! usability bug 🐛 and removed bug 🐛 labels Feb 12, 2018
@brainwane
Copy link
Contributor

Thanks for your report, @pv, and sorry for the slow response!

The folks working on Warehouse have gotten funding to concentrate on improving and deploying Warehouse, and have kicked off work towards our development roadmap -- the most urgent task is to improve Warehouse to the point where we can redirect pypi.python.org to pypi.org so the site is more sustainable and reliable.

We discussed this issue in our meeting today to prioritize it. Since search in Warehouse is already much better than search on legacy PyPI, but users will probably expect search to work as you suggest, I've moved this issue to a future milestone that we'll work on in the next few months.

Thanks, @pv, and sorry again for the wait.

Note to people thinking about contributing to Warehouse: this would be a great first issue for a new contributor to tackle if the new contributor were already familiar with Elasticsearch.

@waseem18
Copy link
Contributor

I'm looking into this issue.

This doesn't look like a simple match to phrase_match change.

@di
Copy link
Member

di commented Feb 26, 2018

@waseem18 Yeah, I'm not totally sure that this issue is actually a "good first issue" -- elasticsearch tuning has generally been pretty tricky in my experience. But since you're not a new contributor anymore, should be a good issue for you. 🙂

@waseem18
Copy link
Contributor

waseem18 commented Mar 1, 2018

Update:

After changing match to match_phrase at this line and searching Warehouse with string containing spaces (example: cli github), we get Transport Error.

elasticsearch.exceptions.TransportError: TransportError(500, 'search_phase_execution_exception', 'field "normalized_name" was indexed without position data; cannot run PhraseQuery (phrase=normalized_name:"cli github")')

As phrase queries require index_options: positions- I changed
normalized_name = Text(analyzer=NameAnalyzer, index_options="docs") to normalized_name = Text(analyzer=NameAnalyzer, index_options="positions") here and then reindexed.

At this point the search functionality works but I found the results are not efficient.

@honzakral
Copy link
Contributor

@waseem18 what do you mean "not efficient"? I would be happy to help fine-tune the queries

@brainwane
Copy link
Contributor

@waseem18 Maybe you have done some profiling and you have some specific performance numbers to share?

@waseem18
Copy link
Contributor

waseem18 commented Mar 1, 2018

@honzakral By not efficient I mean the search results are way better when index_options=docs.

For example when we search image processing routines, index_options=docs gives good results while index_options=positions doesn't give any result.

This is same with queries containing spaces.

@brainwane
Copy link
Contributor

@waseem18 it's going to be helpful if you give concrete numbers or results, I think. Saying "good results" might mean that there are a lot of them, or they're relevant, or they're well ordered... feel free to cut and paste or take screenshots to show what you mean. :)

@waseem18
Copy link
Contributor

waseem18 commented Mar 1, 2018

Sure @brainwane

@honzakral
Copy link
Contributor

In this particular case there should be no difference between which documents match with various index_options as long as the query remains the same. Only ordering (_score) of hits would be different where docs doesn't take into account frequency (how many times the word occurs in the document). positions is the default and should be used here

@waseem18
Copy link
Contributor

waseem18 commented Mar 1, 2018

Thanks for the information @honzakral I'll get back with some examples and screenshots.

@waseem18
Copy link
Contributor

waseem18 commented Mar 1, 2018

@honzakral Thanks for the help.

  • Tweaked index_options to positions for Package index and respective match_phrase change.
  • Changed SEARCH_BOOST value of description from 5 to 10 which I believe improved phrase queries.
  • Attached some screenshots which show how relevant the results are without and with double quotes respectively.

1
1

2
2

3
3

@pv
Copy link
Contributor Author

pv commented Mar 1, 2018 via email

@waseem18
Copy link
Contributor

waseem18 commented Mar 1, 2018

good use case @pv

Will have that in mind

@honzakral
Copy link
Contributor

honzakral commented Mar 1, 2018

That looks good @waseem18! What I would have expected! Note that for doing bool operators you can just use the Q object from elasticsearch-dsl: should.append(Q("match", ...) & Q("match_phrase", ...) & Q(...))

@waseem18
Copy link
Contributor

waseem18 commented Mar 2, 2018

Thanks for the suggestion @honzakral I'll surely follow that.

@brainwane brainwane modified the milestones: 3: Publicize beta, 6. Post Legacy Shutdown Mar 6, 2018
@brainwane
Copy link
Contributor

In today's Warehouse developers' meeting we decided to pare down our near-future milestones on our development roadmap so they really only contain the essential bugfixes and features we need to launch, replace legacy PyPI, and shut down the old site. So I'm moving this issue into a milestone further in the future; sorry for the wait.

@di di closed this as completed in #3111 May 16, 2018
@bslade
Copy link

bslade commented Feb 24, 2024

Searching for "JSON5" (with the quotes) brings up 1200 results of mostly garbage:

Here are some example results (1st page only):

  • anyconfig-json5-backend 0.2.1Jul 9, 2023
  • json-five 1.1.1Aug 4, 2023
  • json2json 0.1.0Mar 6, 2020
  • JSON4JSON 0.4.3Mar 21, 2021
  • pySpack 0.2.1Aug 24, 2020
  • anyser 0.2.0Mar 20, 2021
  • sick-json 0.0.4May 8, 2023
  • bit-field 1.0.1May 9, 2023
  • Yucebio-Config 0.1.0Sep 28, 2021
  • json5 0.9.17Feb 19, 2024
  • json3 1.0May 4, 2022
  • json2 0.8.0Oct 10, 2023
  • json5kit 0.4.0May 16, 2023
  • easy-json2json 0.0.2Jun 4, 2020
  • environmentinator 0.1.5Oct 3, 2023
  • btmhdw 2.2.2Mar 25, 2020
  • jinja2-cli-tddschn 0.9.2Nov 28, 2023
  • mjson5 0.9.13.2Mar 23, 2023

So it doesn't looke "completed" to me.

@miketheman
Copy link
Member

@bslade See #10718 instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request good first issue This issue is ideal for first-time contributors! search Opensearch, search filters, and so on usability
Projects
None yet
Development

No branches or pull requests

7 participants