Skip to content
Nilesh Chakraborty edited this page Jun 9, 2014 · 1 revision

M1 - Single language extraction [done]

Downloads and calculates redirects locally and uses worker nodes for extraction.

M2 - Implement redirect parallelization [done]

Investigate if the redirects are calculated faster locally or with worker nodes. If not faster, make it configurable.

M3 - Multiple language extraction

Downloads all languages locally and uses worker nodes for extraction.

M4 - Downloading parallelization

Download wikipedia dumps from worker nodes

M5 - Extraction for a language starts as soon as the dump is downloaded.

M6 - Stream extraction from download stream (if there is time)

Start calculating redirects & extraction from the download stream

M7 - Investigate for abstract extractor implementation (if there is time)

M8 - Make this project as extensible as possible in order to be re-use it in other data processing pipelines.