A powerful Streamlit web app that intelligently extracts 50+ comprehensive product attributes from any e-commerce website and creates enhanced supplemental feeds for Google Shopping.
Extracts everything - from basic details (price, images, descriptions) to category-specific attributes (processor specs for electronics, ISBN for books, assembly requirements for furniture) automatically.
Works across different e-commerce platforms using multiple extraction strategies:
-
π― Structured Data First (JSON-LD Schema.org)
- Extracts rich product data from JSON-LD markup
- Handles Product, Offer, Brand schemas automatically
-
π·οΈ Meta Tags (Open Graph, Twitter Cards)
- Fallback to social media meta tags
- Extracts images, prices, descriptions
-
π Smart HTML Parsing
- Common CSS selectors for e-commerce elements
- Intelligent pattern matching for product details
-
π Table & List Extraction
- Parses specification tables automatically
- Extracts from definition lists (dl/dt/dd)
-
π Pattern-Based Extraction
- Regex patterns for dimensions, weights, colors
- Context-aware text mining
Extracts 50+ attributes across all product categories:
Core Product Data:
- Product title & description
- Price, sale price & currency
- Main product image + additional images
- Brand name
- SKU, GTIN, MPN codes
- Product condition (new/refurbished/used)
- Availability status
- Product highlights/key features
Google Shopping Feed Attributes:
- Apparel (Required): color, size, material, pattern
- Apparel (Recommended): age_group, gender, fit_type
- Product type/category
- Multipack quantity
- Energy efficiency class
- Item group ID (for variants)
Physical Attributes:
- Product dimensions
- Shipping dimensions (package size)
- Weight (product & shipping)
Ratings & Reviews:
- Rating value
- Review count
Category-Specific Attributes:
π± Electronics:
- Processor (Intel Core, AMD Ryzen, Apple M-series)
- RAM memory
- Storage capacity
- Screen size
- Model number
π Books:
- Author name
- ISBN
- Page count
- Format (Hardcover/Paperback/eBook)
- Publisher
πͺ Furniture:
- Assembly required (yes/no)
- Weight capacity
- Material composition
- Care instructions
π Apparel:
- Fit type (Slim/Regular/Relaxed/Oversized)
- Care instructions
- Fabric composition
- Size chart
β‘ Appliances:
- Motor/Power specifications
- Energy efficiency rating
- Warranty information
- Voltage/Wattage
π Paper Products:
- GSM (paper weight)
- Sheet count
- Material type
And more attributes automatically detected based on product type!
- π€ Upload Google Shopping XML feeds
- π Real-time progress tracking
- π Detailed attribute coverage statistics
- πΎ Download results as CSV or Excel
- βοΈ Configurable scraping delay and URL limits
- π― Works with most e-commerce platforms automatically
- Clone this repository:
git clone https://github.com/yourusername/feed-attribute-scraper.git
cd feed-attribute-scraper- Install dependencies:
pip install -r requirements.txt- Run the app:
streamlit run app.pyThe app will open in your browser at http://localhost:8501
- Push this repository to GitHub
- Go to share.streamlit.io
- Sign in with GitHub
- Click "New app"
- Select your repository, branch (main), and main file path (
app.py) - Click "Deploy"
Your app will be live in minutes at https://your-app-name.streamlit.app
- Heroku: Add a
Procfilewithweb: streamlit run app.py - Railway: Works out of the box with
requirements.txt - Render: Set build command to
pip install -r requirements.txtand start command tostreamlit run app.py
- Upload Feed: Upload your Google Shopping XML feed file
- Configure Settings (sidebar):
- Set delay between requests (default: 1 second)
- Optionally limit number of URLs for testing
- Preview URLs: Check the URLs that will be scraped
- Start Scraping: Click the button and wait for completion
- Review Results: View extracted attributes and statistics
- Download: Get your supplemental feed as CSV or Excel
The app expects Google Shopping XML feeds with URLs in this format:
<item>
<g:link>
<![CDATA[ https://example.com/product-url ]]>
</g:link>
</item>Or standard:
<item>
<link>https://example.com/product-url</link>
</item>The supplemental feed can include 50+ attributes (depending on availability):
Core Fields:
id- Product ID from XML feedtitle- Product title (from XML or page)url- Product URLdescription- Product descriptionprice- Regular product pricesale_price- Sale/promotional pricecurrency- Price currency (USD, GBP, etc.)image_url- Main product image URLadditional_image_link- Additional product images (comma-separated)brand- Brand namecondition- Product condition (new, refurbished, used)
Product Identifiers:
sku- Stock Keeping Unitgtin- Global Trade Item Number (UPC/EAN)mpn- Manufacturer Part Number
Product Categorization:
product_type- Product categoryproduct_highlight- Key features/highlights (pipe-separated)keywords- Product keywords
Apparel & Variants:
color- Product color (REQUIRED for apparel)size- Apparel size (REQUIRED for apparel: S, M, L, etc.)material- Material composition (REQUIRED for apparel)pattern- Pattern typeage_group- Target age group (newborn, infant, toddler, kids, adult)gender- Target gender (male, female, unisex)fit_type- Fit style (Slim, Regular, Relaxed, Oversized)
Physical Properties:
size_dimensions- Product dimensionsshipping_dimensions- Package dimensionsweight- Product/shipping weight
Product Details:
availability- Stock statusmultipack- Bundle/pack quantityenergy_efficiency_class- Energy rating (A+++, A++, A+, A, B, C, D, E, F, G)rating- Product rating valuereview_count- Number of reviewswarranty- Warranty information
Category-Specific Attributes:
- Electronics:
processor,ram,storage,screen_size - Books:
author,isbn,pages,format - Furniture:
assembly_required,weight_capacity - Appliances:
motor - Paper Products:
gsm
- Start small: Use the URL limit setting to test with 10-20 URLs first
- Check coverage: Review the attribute coverage statistics to see what's being extracted
- Compare results: Try products from different categories to test extraction quality
- Rate limiting: Keep delay at 1s minimum to respect website servers and avoid being blocked
- Large feeds: 300 URLs at 1s delay = ~5 minutes processing time
- Timeout: Default 15s timeout per URL - adjust if needed for slow sites
- Modern e-commerce sites (Shopify, WooCommerce, Magento): 80-95% attribute coverage
- Sites with JSON-LD: Near 100% coverage for structured attributes
- Custom/legacy sites: 40-70% coverage (relies on pattern matching)
- Best results: Sites that implement Schema.org Product markup
The scraper uses a layered approach - you can customize any layer:
Edit extract_structured_data() to add support for additional Schema.org types or custom JSON-LD schemas.
Modify these methods to add site-specific selectors:
extract_price_from_html()- Add price selectorsextract_image_from_html()- Add image selectorsextract_description_from_html()- Add description selectors
Enhance regex patterns in:
extract_dimensions()- Dimension formatsextract_weight()- Weight patternsextract_colour()- Color namesextract_material()- Material keywordsextract_size()- Size formats
Update extract_table_data() to map additional table headers to attributes.
- Add extraction method (e.g.,
extract_rating()) - Call it in
scrape_product_attributes() - Add to structured data extraction if applicable
No URLs found:
- Check your XML uses
<g:link>or<link>tags - Verify the XML feed is valid and properly formatted
Low attribute coverage for a specific site:
- Check if the site uses JSON-LD (view page source, search for "application/ld+json")
- The site may use non-standard HTML structure - add custom selectors
- Some attributes may be loaded dynamically via JavaScript (not accessible to this scraper)
Missing specific attributes:
- Review the attribute coverage statistics to see what's being found
- Check the page source to see how the attribute is marked up
- Add custom patterns to the relevant extraction method
Slow performance:
- Normal behavior - respects rate limiting to avoid being blocked
- Adjust delay in settings (minimum 1s recommended)
- Consider processing in batches
Request errors (403, 429):
- Website may be blocking scraper traffic
- Increase delay between requests
- Some sites require additional headers or authentication
MIT Licence - feel free to use and modify
Pull requests welcome! Please:
- Fork the repository
- Create a feature branch
- Submit a pull request with clear description