Skip to content

Commit 575f1e5

Browse files
authored
Merge pull request #91 from alif-munim/main
feat: Add arXiv database search skill
2 parents 8158971 + 4f46196 commit 575f1e5

3 files changed

Lines changed: 1206 additions & 0 deletions

File tree

Lines changed: 362 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,362 @@
1+
---
2+
name: arxiv-database
3+
description: Search and retrieve preprints from arXiv via the Atom API. Use this skill when searching for papers in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering, or economics by keywords, authors, arXiv IDs, date ranges, or categories.
4+
license: MIT
5+
metadata:
6+
skill-author: Orchestra Research
7+
---
8+
9+
# arXiv Database
10+
11+
## Overview
12+
13+
This skill provides Python tools for searching and retrieving preprints from arXiv.org via its public Atom API. It supports keyword search, author search, category filtering, arXiv ID lookup, and PDF download. Results are returned as structured JSON with titles, abstracts, authors, categories, and links.
14+
15+
## When to Use This Skill
16+
17+
Use this skill when:
18+
- Searching for preprints in CS, ML, AI, physics, math, statistics, q-bio, q-fin, or economics
19+
- Looking up specific papers by arXiv ID (e.g., `2309.10668`)
20+
- Tracking an author's recent preprints
21+
- Filtering papers by arXiv category (e.g., `cs.LG`, `cs.CL`, `stat.ML`)
22+
- Downloading PDFs for full-text analysis
23+
- Building literature review datasets for AI/ML research
24+
- Monitoring new submissions in a subfield
25+
26+
Consider alternatives when:
27+
- Searching for biomedical literature specifically -> Use **pubmed-database** or **biorxiv-database**
28+
- You need citation counts or impact metrics -> Use **openalex-database**
29+
- You need peer-reviewed journal articles only -> Use **pubmed-database**
30+
31+
## Core Search Capabilities
32+
33+
### 1. Keyword Search
34+
35+
Search for papers by keywords in titles, abstracts, or all fields.
36+
37+
```bash
38+
python scripts/arxiv_search.py \
39+
--keywords "sparse autoencoders" "mechanistic interpretability" \
40+
--max-results 20 \
41+
--output results.json
42+
```
43+
44+
With category filter:
45+
```bash
46+
python scripts/arxiv_search.py \
47+
--keywords "transformer" "attention mechanism" \
48+
--category cs.LG \
49+
--max-results 50 \
50+
--output transformer_papers.json
51+
```
52+
53+
Search specific fields:
54+
```bash
55+
# Title only
56+
python scripts/arxiv_search.py \
57+
--keywords "GRPO" \
58+
--search-field ti \
59+
--max-results 10
60+
61+
# Abstract only
62+
python scripts/arxiv_search.py \
63+
--keywords "reward model" "RLHF" \
64+
--search-field abs \
65+
--max-results 30
66+
```
67+
68+
### 2. Author Search
69+
70+
```bash
71+
python scripts/arxiv_search.py \
72+
--author "Anthropic" \
73+
--max-results 50 \
74+
--output anthropic_papers.json
75+
```
76+
77+
```bash
78+
python scripts/arxiv_search.py \
79+
--author "Ilya Sutskever" \
80+
--category cs.LG \
81+
--max-results 20
82+
```
83+
84+
### 3. arXiv ID Lookup
85+
86+
Retrieve metadata for specific papers:
87+
88+
```bash
89+
python scripts/arxiv_search.py \
90+
--ids 2309.10668 2406.04093 2310.01405 \
91+
--output sae_papers.json
92+
```
93+
94+
Full arXiv URLs also accepted:
95+
```bash
96+
python scripts/arxiv_search.py \
97+
--ids "https://arxiv.org/abs/2309.10668"
98+
```
99+
100+
### 4. Category Browsing
101+
102+
List recent papers in a category:
103+
```bash
104+
python scripts/arxiv_search.py \
105+
--category cs.AI \
106+
--max-results 100 \
107+
--sort-by submittedDate \
108+
--output recent_cs_ai.json
109+
```
110+
111+
### 5. PDF Download
112+
113+
```bash
114+
python scripts/arxiv_search.py \
115+
--ids 2309.10668 \
116+
--download-pdf papers/
117+
```
118+
119+
Batch download from search results:
120+
```python
121+
import json
122+
from scripts.arxiv_search import ArxivSearcher
123+
124+
searcher = ArxivSearcher()
125+
126+
# Search first
127+
results = searcher.search(query="ti:sparse autoencoder", max_results=5)
128+
129+
# Download all
130+
for paper in results:
131+
arxiv_id = paper["arxiv_id"]
132+
searcher.download_pdf(arxiv_id, f"papers/{arxiv_id.replace('/', '_')}.pdf")
133+
```
134+
135+
## arXiv Categories
136+
137+
### Computer Science (cs.*)
138+
| Category | Description |
139+
|----------|-------------|
140+
| `cs.AI` | Artificial Intelligence |
141+
| `cs.CL` | Computation and Language (NLP) |
142+
| `cs.CV` | Computer Vision |
143+
| `cs.LG` | Machine Learning |
144+
| `cs.NE` | Neural and Evolutionary Computing |
145+
| `cs.RO` | Robotics |
146+
| `cs.CR` | Cryptography and Security |
147+
| `cs.DS` | Data Structures and Algorithms |
148+
| `cs.IR` | Information Retrieval |
149+
| `cs.SE` | Software Engineering |
150+
151+
### Statistics & Math
152+
| Category | Description |
153+
|----------|-------------|
154+
| `stat.ML` | Machine Learning (Statistics) |
155+
| `stat.ME` | Methodology |
156+
| `math.OC` | Optimization and Control |
157+
| `math.ST` | Statistics Theory |
158+
159+
### Other Relevant Categories
160+
| Category | Description |
161+
|----------|-------------|
162+
| `q-bio.BM` | Biomolecules |
163+
| `q-bio.GN` | Genomics |
164+
| `q-bio.QM` | Quantitative Methods |
165+
| `q-fin.ST` | Statistical Finance |
166+
| `eess.SP` | Signal Processing |
167+
| `physics.comp-ph` | Computational Physics |
168+
169+
Full list: see [references/api_reference.md](references/api_reference.md).
170+
171+
## Query Syntax
172+
173+
The arXiv API uses prefix-based field searches combined with Boolean operators.
174+
175+
**Field prefixes:**
176+
- `ti:` - Title
177+
- `au:` - Author
178+
- `abs:` - Abstract
179+
- `cat:` - Category
180+
- `all:` - All fields (default)
181+
- `co:` - Comment
182+
- `jr:` - Journal reference
183+
- `id:` - arXiv ID
184+
185+
**Boolean operators** (must be UPPERCASE):
186+
```
187+
ti:transformer AND abs:attention
188+
au:bengio OR au:lecun
189+
cat:cs.LG ANDNOT cat:cs.CV
190+
```
191+
192+
**Grouping with parentheses:**
193+
```
194+
(ti:sparse AND ti:autoencoder) AND cat:cs.LG
195+
au:anthropic AND (abs:interpretability OR abs:alignment)
196+
```
197+
198+
**Examples:**
199+
```python
200+
from scripts.arxiv_search import ArxivSearcher
201+
202+
searcher = ArxivSearcher()
203+
204+
# Papers about SAEs in ML
205+
results = searcher.search(
206+
query="ti:sparse autoencoder AND cat:cs.LG",
207+
max_results=50,
208+
sort_by="submittedDate"
209+
)
210+
211+
# Specific author in specific field
212+
results = searcher.search(
213+
query="au:neel nanda AND cat:cs.LG",
214+
max_results=20
215+
)
216+
217+
# Complex boolean query
218+
results = searcher.search(
219+
query="(abs:RLHF OR abs:reinforcement learning from human feedback) AND cat:cs.CL",
220+
max_results=100
221+
)
222+
```
223+
224+
## Output Format
225+
226+
All searches return structured JSON:
227+
228+
```json
229+
{
230+
"query": "ti:sparse autoencoder AND cat:cs.LG",
231+
"result_count": 15,
232+
"results": [
233+
{
234+
"arxiv_id": "2309.10668",
235+
"title": "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning",
236+
"authors": ["Trenton Bricken", "Adly Templeton", "..."],
237+
"abstract": "Full abstract text...",
238+
"categories": ["cs.LG", "cs.AI"],
239+
"primary_category": "cs.LG",
240+
"published": "2023-09-19T17:58:00Z",
241+
"updated": "2023-10-04T14:22:00Z",
242+
"doi": "10.48550/arXiv.2309.10668",
243+
"pdf_url": "http://arxiv.org/pdf/2309.10668v1",
244+
"abs_url": "http://arxiv.org/abs/2309.10668v1",
245+
"comment": "42 pages, 30 figures",
246+
"journal_ref": ""
247+
}
248+
]
249+
}
250+
```
251+
252+
## Common Usage Patterns
253+
254+
### Literature Review Workflow
255+
256+
```python
257+
from scripts.arxiv_search import ArxivSearcher
258+
import json
259+
260+
searcher = ArxivSearcher()
261+
262+
# 1. Broad search
263+
results = searcher.search(
264+
query="abs:mechanistic interpretability AND cat:cs.LG",
265+
max_results=200,
266+
sort_by="submittedDate"
267+
)
268+
269+
# 2. Save results
270+
with open("interp_papers.json", "w") as f:
271+
json.dump({"result_count": len(results), "results": results}, f, indent=2)
272+
273+
# 3. Filter and analyze
274+
import pandas as pd
275+
df = pd.DataFrame(results)
276+
print(f"Total papers: {len(df)}")
277+
print(f"Date range: {df['published'].min()} to {df['published'].max()}")
278+
print(f"\nTop categories:")
279+
print(df["primary_category"].value_counts().head(10))
280+
```
281+
282+
### Track a Research Group
283+
284+
```python
285+
searcher = ArxivSearcher()
286+
287+
groups = {
288+
"anthropic": "au:anthropic AND (cat:cs.LG OR cat:cs.CL)",
289+
"openai": "au:openai AND cat:cs.CL",
290+
"deepmind": "au:deepmind AND cat:cs.LG",
291+
}
292+
293+
for name, query in groups.items():
294+
results = searcher.search(query=query, max_results=50, sort_by="submittedDate")
295+
print(f"{name}: {len(results)} recent papers")
296+
```
297+
298+
### Monitor New Submissions
299+
300+
```python
301+
searcher = ArxivSearcher()
302+
303+
# Most recent ML papers
304+
results = searcher.search(
305+
query="cat:cs.LG",
306+
max_results=50,
307+
sort_by="submittedDate",
308+
sort_order="descending"
309+
)
310+
311+
for paper in results[:10]:
312+
print(f"[{paper['published'][:10]}] {paper['title']}")
313+
print(f" {paper['abs_url']}\n")
314+
```
315+
316+
## Python API
317+
318+
```python
319+
from scripts.arxiv_search import ArxivSearcher
320+
321+
searcher = ArxivSearcher(verbose=True)
322+
323+
# Free-form query (uses arXiv query syntax)
324+
results = searcher.search(query="...", max_results=50)
325+
326+
# Lookup by ID
327+
papers = searcher.get_by_ids(["2309.10668", "2406.04093"])
328+
329+
# Download PDF
330+
searcher.download_pdf("2309.10668", "paper.pdf")
331+
332+
# Build query from components
333+
query = ArxivSearcher.build_query(
334+
title="sparse autoencoder",
335+
author="anthropic",
336+
category="cs.LG"
337+
)
338+
results = searcher.search(query=query, max_results=20)
339+
```
340+
341+
## Best Practices
342+
343+
1. **Respect rate limits**: The API requests 3-second delays between calls. The script handles this automatically.
344+
2. **Use category filters**: Dramatically reduces noise. `cs.LG` is where most ML papers live.
345+
3. **Cache results**: Save to JSON to avoid re-fetching.
346+
4. **Use `sort_by=submittedDate`** for recent papers, `relevance` for keyword searches.
347+
5. **Max 300 results per query**: arXiv API caps at this. For larger sets, paginate with `start` parameter.
348+
6. **arXiv IDs**: Use bare IDs (`2309.10668`), not full URLs, in programmatic code.
349+
7. **Combine with openalex-database**: For citation counts and impact metrics arXiv doesn't provide.
350+
351+
## Limitations
352+
353+
- **No full-text search**: Only searches metadata (title, abstract, authors, comments)
354+
- **No citation data**: Use openalex-database or Semantic Scholar for citations
355+
- **Max 300 results**: Per query. Use pagination for larger sets.
356+
- **Rate limited**: ~1 request per 3 seconds recommended
357+
- **Atom XML responses**: The script parses these into JSON automatically
358+
- **Search lag**: New papers may take hours to appear in API results
359+
360+
## Reference Documentation
361+
362+
- **API Reference**: See [references/api_reference.md](references/api_reference.md) for full endpoint specs, all categories, and response schemas

0 commit comments

Comments
 (0)