UCL IRDM 2017 Group Project - Option 1
We are using both of the open source IR packages below:
-
Build nutch with
ant. -
Create a
urlsdirectory underapache-nutch-1.12/runtime/local. -
Create
seed.txtfile underurlsand puthttp://www.cs.ucl.ac.uk/into the file. -
Create new crawldb by executing
bin/nutch inject crawl/crawldb urlsunder theapache-nutch-1.12/runtime/localfolder. -
Start crawling with our
fetch.shscript which is under thenutch_shellfolder in the format like./fetch.sh <Iterations>. -
Dedup nutch by
bin/nutch dedup crawl/crawldb.
-
Generate webgraph by
bin/nutch webgraph -webgraphdb crawl/webgraphdb -segment crawl/segments/*. -
Execute PageRank by
bin/nutch org.apache.nutch.scoring.webgraph.PageRank -webgraphdb crawl/webgraphdb. -
Update score in crawldb by
bin/nutch scoreupdater -crawldb crawl/crawldb -webgraphdb crawl/webgraphdb. -
Put
scoring-linkinto the<value>tag of the property with<name>plugin.includes</name>inapache-nutch-1.12/runtime/local/conf/nutch-site.xml. Or put it inapache-nutch-1.12/conf/nutch-site.xmland rebuild with ant. -
Reindex solr.
-
Start solr server.
-
Create a new core
uclwithbin/solr create -c ucl. -
Modify the schema or ucl by modifying
managed-schema.xmland restart server or throuth the solr api. Change type ofcontenttotext_general. -
Index with nutch by
bin/nutch solrindex http://localhost:8983/solr/ucl crawl/crawldb crawl/segments/* -normalize -deleteGone.