Skip to content

Commit 75b1fa9

Browse files
author
jhengy
committed
Update README.md
1 parent a954174 commit 75b1fa9

File tree

1 file changed

+66
-45
lines changed

1 file changed

+66
-45
lines changed

README.md

Lines changed: 66 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,82 @@
11
# Web Content Summarizer
22

3-
A Python tool that scrapes web articles and generates summaries using the Generative AI.
3+
A Python tool that aggregates and summarizes web content using AI. Features include:
4+
5+
## Features
6+
7+
| Feature | Description | Status |
8+
|------------------------|-----------------------------------------------------------------------------|--------|
9+
| Web Scraping | Extract articles from websites and blogs ||
10+
| AI Summarization | Generate concise summaries using Gemini models ||
11+
| RSS Feed Support | Process content from RSS/Atom feeds ||
12+
| PDF Processing | Extract text content from PDF documents ||
13+
| CI/CD Integration | Automated daily summaries via GitHub Actions ||
14+
| Date Filtering | Filter content by publication date ||
15+
| Dynamic Content | Handle JavaScript-rendered pages using Playwright ||
416

517
## Setup
618

7-
1. Clone the repository
8-
2. Create a virtual environment:
9-
```bash
10-
python3 -m venv venv
11-
source venv/bin/activate # On Windows: venv\Scripts\activate
12-
```
13-
3. Install dependencies:
14-
```bash
15-
pip install -e . # install package in editable mode with dependencies
16-
playwright install chromium
17-
```
18-
4. Configure environment variables:
19-
```bash
20-
cp .env.example .env
21-
```
22-
Edit `.env` with:
23-
```
24-
GEMINI_API_KEY=your_api_key_here
25-
GEMINI_MODEL_SUMMARIZE=gemini-1.5-pro-latest
26-
GEMINI_MODEL_DATE_EXTRACT=gemini-1.5-pro-latest
27-
```
19+
### Installation
20+
```bash
21+
# Clone repository
22+
git clone https://github.com/yourusername/content-aggregator.git
23+
cd content-aggregator
24+
25+
# Create and activate virtual environment
26+
python3 -m venv venv
27+
source venv/bin/activate # Windows: venv\Scripts\activate
28+
29+
# Install with dependencies
30+
pip install -e .
31+
playwright install chromium
32+
playwright install-deps
33+
```
34+
35+
### Configuration
36+
1. Create `.env` file:
37+
```bash
38+
cp .env.example .env
39+
```
40+
2. Edit `.env` with your Gemini API details:
41+
```env
42+
GEMINI_API_KEY=your_api_key_here
43+
GEMINI_MODEL_SUMMARIZE=gemini-2.0-flash-exp
44+
GEMINI_MODEL_DATE_EXTRACT=gemini-2.0-flash-exp
45+
```
2846

2947
## Usage
3048

31-
### Local Execution
49+
### Basic Usage
3250
```bash
33-
# generate issue
51+
# Run aggregator and generate issue
3452
scripts/run.sh
3553
```
3654

55+
### CLI Commands
56+
| Command | Description | Example |
57+
|------------------------|---------------------------------------------|----------------------------------|
58+
| `run` | Default aggregation process | `content-aggregator run` |
59+
60+
### Testing
61+
```bash
62+
# Install with development dependencies
63+
pip install -e '.[dev]'
64+
65+
# Run all tests
66+
pytest tests/ -v -s
67+
68+
# Generate coverage report
69+
pytest --cov=content_aggregator --cov-report=html -s
70+
```
71+
3772
### Automated Daily Summaries
38-
The system includes GitHub Actions configured to:
39-
- Run daily at 08:00 UTC
40-
- Process up to 5 articles
41-
- Create GitHub issues with summaries
42-
- Store results as workflow artifacts
73+
[![CI](https://github.com/jhengy/content-aggregator/actions/workflows/run.yml/badge.svg)](https://github.com/jhengy/content-aggregator/issues)
74+
75+
The GitHub Actions workflow:
76+
- Runs daily (off-peak time)
77+
- Processes configured content sources
78+
- Creates GitHub issues with summaries
79+
- Stores JSON results and summaries as artifacts
4380

4481
Output files will be created in:
4582
- `outputs/results_*.json`: Full results in JSON format
@@ -121,19 +158,3 @@ For GitHub Actions execution, ensure these repository settings:
121158
- summarization and extraction from web url, skip web scraping content before passing to llm
122159
- to what extent can ai model successfully extract content and summarize it based on the url? Signal to noise ratio
123160

124-
## For Developers
125-
126-
### Installation
127-
```bash
128-
# Install with development dependencies
129-
pip install -e '.[dev]'
130-
```
131-
132-
### Running Tests
133-
```bash
134-
# Basic tests
135-
pytest tests/ -v -s
136-
137-
# With coverage report
138-
pytest --cov=content_aggregator --cov-report=html -s
139-
```

0 commit comments

Comments
 (0)