Skip to content

Quora is the most popular question-and-answer platform on the internet. Millions of questions are asked every year on this platform. One of the problems is many people ask similar questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question and make writers feel they need to an…

Notifications You must be signed in to change notification settings

chnaveen138/Quora-question-pair-similarity-prediction

Repository files navigation

Quora-question-pair-similarity-prediction

Quora is the most popular question-and-answer platform on the internet. Millions of questions are asked every year on this platform. One of the problems is many people ask similar questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question and make writers feel they need to answer multiple versions of the same question. If we can find similar questions we can link them so that answers of one reflects others. Of course, more manpower is required to find similar questions and link them as more than a million questions are asked every year. Using modern machine learning techniques I tried to automate this task. Ipython notebook goes through stages like data acquisition, data preparation like data cleaning and data preprocessing, exploratory data analysis, modeling. Basic features like word similarity, frequency, etc... are used to check how they can affect the prediction. Advanced features(fuzz features) like fuzz_ratio, token_sort ratio, etc... added much more value in predicting the similarity. As prediction is dealt with the probability of the similarity 'log-loss' is used as a metric. Vectorizations like tf-idf(term frequency-inverse document frequency) is used to represent text in numerical vectors. Data is visualized using t-SNE(t-distributed stochastic neighborhood embedding). Machine learning algorithms like logistic regression, linear support vector machines, gradient boosting is hyperparameter tuned to find the best model.

About

Quora is the most popular question-and-answer platform on the internet. Millions of questions are asked every year on this platform. One of the problems is many people ask similar questions. Multiple questions with the same intent can cause seekers to spend more time finding the best answer to their question and make writers feel they need to an…

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published