To create an environment with the environment.yml file, navigate to the folder containing the environment file and type conda env create -f environment.yml
into your terminal. To activate your environment, do conda activate env-name
. The environment in the yml file is named IR
.
Dependency
sys
os
pickle
csv
random
re
urllib
validators
requests
http.client
BeautifulSoup
numpy
matplotlib
nltk
rake
-
class Graph
is used for constructing web graph object that is used for calculating PageRank. -
class Doc
stores information about the document for calculating tf-idf (term frequency-inverse document frequency)doc_id
: id of documentterm_freq
: frequency of term in the documentmost_common
:maximum frequency in the document
class BIR
creates 2Dnumpy
array with (term,Doc
) pair for every terms in the document lists. Moreover, it will also creates and stores thetf-idf
scores for the given documents. -
class Conversion
converts the advanced search query into appropriate format and builds a expression Tree to process query. -
class PageRank
aims to estimate the importance/reliability of a given web page based on the PageRank algorithm. -
extract_keywords(query, stem=True, return_score=False)
is used for extracting keywords from given query
-
This program scrapes web pages starting from given seed url and its hyperlinks in the content and builds web graph while scraping
Output:
/Contents
- Contents of each crawled web pages in
.txt
format
/Data
need_to_crawl.csv
: web pages added to graph but not crawled yetid_url.csv
andurl_id.csv
: dictionary of web pages that is in the graphc_id_url.csv
andc_url_id.csv
: dictionary of web pages that is crawledgraph.p
: web graph storingGraph
objectbir.p
: the binary inverted table to search for documents from the relevant query.tf-idf.p
: a 2D array to store thetf-idf
scores for all(term, document)
pairs.Pagerank_score.npy
: store a numpy array of Pagerank scores calculated based on the web graph.
- Contents of each crawled web pages in
-
This program uses the
graph.p
variable and calculates the PageRank and save tf-idf table from contents of crawled web pages. It will also write the binary inverted index table from the web pages' content stored in theContents
folder. If you want to try building these yourself, you can try runningpython3 setup.py
. Otherwise, all the necessary setup files are stored in theData
zip file. Simply unzip the folder before running search.
This program starts a new session for a client and supports searches of list of relevant web pages to the user query based on tf-idf inverted index and web graph.
Type python main.py
into the terminal to run the search machine
The program allows the user to choose from 3 options (s, a, q)
- s: search
- For search, simply input any query as you would for Google :)
- a: advance search
AND
NOT
OR
keywords to search more accurately
- q: quit the search machine
If the user selects search or advanced search, the program asks you to type query.