|
1 | 1 | # Web Content Summarizer
|
2 | 2 |
|
3 |
| -A Python tool that scrapes web articles and generates summaries using the Generative AI. |
| 3 | +A Python tool that aggregates and summarizes web content using AI. Features include: |
| 4 | + |
| 5 | +## Features |
| 6 | + |
| 7 | +| Feature | Description | Status | |
| 8 | +|------------------------|-----------------------------------------------------------------------------|--------| |
| 9 | +| Web Scraping | Extract articles from websites and blogs | ✅ | |
| 10 | +| AI Summarization | Generate concise summaries using Gemini models | ✅ | |
| 11 | +| RSS Feed Support | Process content from RSS/Atom feeds | ✅ | |
| 12 | +| PDF Processing | Extract text content from PDF documents | ✅ | |
| 13 | +| CI/CD Integration | Automated daily summaries via GitHub Actions | ✅ | |
| 14 | +| Date Filtering | Filter content by publication date | ✅ | |
| 15 | +| Dynamic Content | Handle JavaScript-rendered pages using Playwright | ✅ | |
4 | 16 |
|
5 | 17 | ## Setup
|
6 | 18 |
|
7 |
| -1. Clone the repository |
8 |
| -2. Create a virtual environment: |
9 |
| - ```bash |
10 |
| - python3 -m venv venv |
11 |
| - source venv/bin/activate # On Windows: venv\Scripts\activate |
12 |
| - ``` |
13 |
| -3. Install dependencies: |
14 |
| - ```bash |
15 |
| - pip install -e . # install package in editable mode with dependencies |
16 |
| - playwright install chromium |
17 |
| - ``` |
18 |
| -4. Configure environment variables: |
19 |
| - ```bash |
20 |
| - cp .env.example .env |
21 |
| - ``` |
22 |
| - Edit `.env` with: |
23 |
| - ``` |
24 |
| - GEMINI_API_KEY=your_api_key_here |
25 |
| - GEMINI_MODEL_SUMMARIZE=gemini-1.5-pro-latest |
26 |
| - GEMINI_MODEL_DATE_EXTRACT=gemini-1.5-pro-latest |
27 |
| - ``` |
| 19 | +### Installation |
| 20 | +```bash |
| 21 | +# Clone repository |
| 22 | +git clone https://github.com/yourusername/content-aggregator.git |
| 23 | +cd content-aggregator |
| 24 | + |
| 25 | +# Create and activate virtual environment |
| 26 | +python3 -m venv venv |
| 27 | +source venv/bin/activate # Windows: venv\Scripts\activate |
| 28 | + |
| 29 | +# Install with dependencies |
| 30 | +pip install -e . |
| 31 | +playwright install chromium |
| 32 | +playwright install-deps |
| 33 | +``` |
| 34 | + |
| 35 | +### Configuration |
| 36 | +1. Create `.env` file: |
| 37 | + ```bash |
| 38 | + cp .env.example .env |
| 39 | + ``` |
| 40 | +2. Edit `.env` with your Gemini API details: |
| 41 | + ```env |
| 42 | + GEMINI_API_KEY=your_api_key_here |
| 43 | + GEMINI_MODEL_SUMMARIZE=gemini-2.0-flash-exp |
| 44 | + GEMINI_MODEL_DATE_EXTRACT=gemini-2.0-flash-exp |
| 45 | + ``` |
28 | 46 |
|
29 | 47 | ## Usage
|
30 | 48 |
|
31 |
| -### Local Execution |
| 49 | +### Basic Usage |
32 | 50 | ```bash
|
33 |
| -# generate issue |
| 51 | +# Run aggregator and generate issue |
34 | 52 | scripts/run.sh
|
35 | 53 | ```
|
36 | 54 |
|
| 55 | +### CLI Commands |
| 56 | +| Command | Description | Example | |
| 57 | +|------------------------|---------------------------------------------|----------------------------------| |
| 58 | +| `run` | Default aggregation process | `content-aggregator run` | |
| 59 | + |
| 60 | +### Testing |
| 61 | +```bash |
| 62 | +# Install with development dependencies |
| 63 | +pip install -e '.[dev]' |
| 64 | +
|
| 65 | +# Run all tests |
| 66 | +pytest tests/ -v -s |
| 67 | +
|
| 68 | +# Generate coverage report |
| 69 | +pytest --cov=content_aggregator --cov-report=html -s |
| 70 | +``` |
| 71 | + |
37 | 72 | ### Automated Daily Summaries
|
38 |
| -The system includes GitHub Actions configured to: |
39 |
| -- Run daily at 08:00 UTC |
40 |
| -- Process up to 5 articles |
41 |
| -- Create GitHub issues with summaries |
42 |
| -- Store results as workflow artifacts |
| 73 | +[](https://github.com/jhengy/content-aggregator/issues) |
| 74 | + |
| 75 | +The GitHub Actions workflow: |
| 76 | +- Runs daily (off-peak time) |
| 77 | +- Processes configured content sources |
| 78 | +- Creates GitHub issues with summaries |
| 79 | +- Stores JSON results and summaries as artifacts |
43 | 80 |
|
44 | 81 | Output files will be created in:
|
45 | 82 | - `outputs/results_*.json`: Full results in JSON format
|
@@ -121,19 +158,3 @@ For GitHub Actions execution, ensure these repository settings:
|
121 | 158 | - summarization and extraction from web url, skip web scraping content before passing to llm
|
122 | 159 | - to what extent can ai model successfully extract content and summarize it based on the url? Signal to noise ratio
|
123 | 160 |
|
124 |
| -## For Developers |
125 |
| - |
126 |
| -### Installation |
127 |
| -```bash |
128 |
| -# Install with development dependencies |
129 |
| -pip install -e '.[dev]' |
130 |
| -``` |
131 |
| - |
132 |
| -### Running Tests |
133 |
| -```bash |
134 |
| -# Basic tests |
135 |
| -pytest tests/ -v -s |
136 |
| - |
137 |
| -# With coverage report |
138 |
| -pytest --cov=content_aggregator --cov-report=html -s |
139 |
| -``` |
|
0 commit comments