Using Scrapy to extract data from an e-commerce platform

I’m learning how to use Scrapy to gather info from a big online store. I’ve managed to get basic details like product names, prices, and links from the category pages. But now I want to go deeper. How can I use those links I’ve scraped to get more info from each product’s page? I want to do this all in one go, without having to run separate scripts. Here’s what I’ve got so far:

import scrapy

class OnlineStoreSpider(scrapy.Spider):
    name = 'store_crawler'
    start_urls = ['https://www.example-store.com/category/electronics']

    def parse(self, response):
        for item in response.css('.product-item'):
            yield {
                'name': item.css('.product-name::text').get(),
                'price': item.css('.product-price::text').get(),
                'url': item.css('.product-link::attr(href)').get()
            }

        next_page = response.css('.next-page::attr(href)').get()
        if next_page:
            yield scrapy.Request(next_page, self.parse)

How can I modify this to also grab details from each product page? Any tips would be great!

Ooh, I love a good scraping challenge! :spider: Have you thought about using Scrapy’s Item Pipeline for this? It’s pretty nifty for processing items after they’re scraped.

Maybe something like this could work:

 def parse(self, response):
     for item in response.css('.product-item'):
         yield scrapy.Request(
             item.css('.product-link::attr(href)').get(),
             callback=self.parse_product,
             meta={'initial_data': {
                 'name': item.css('.product-name::text').get(),
                 'price': item.css('.product-price::text').get()
             }}
         )
 
 def parse_product(self, response):
     item = response.meta['initial_data']
     item.update({
         'description': response.css('.product-description::text').get(),
         'specs': response.css('.product-specs li::text').getall(),
         # Add more fields as needed
     })
     yield item

This way, you’re grabbing the basic info from the category page, then diving deeper into each product page. Cool, right? What other details are you hoping to snag from the product pages?

To grab details from each product page, you’ll want to modify your spider to follow the product URLs and parse those pages. Here’s how you can adjust your code:

In your current parse method, yield a new request to the product page instead of yielding the item directly. Then, create a new method (for example, parse_product) to handle parsing the product page details.

For instance, in parse, after extracting the product link, initiate a scrapy.Request with a callback to parse_product and pass any data you already have via meta. In parse_product, extract additional details like the product description and specifications, and then yield the complete item.

This approach enables you to extract both summary data from the category pages and in-depth details from each product page within a single run.

Hey, to get more details from each product page, you could modify ur parse method to yield a new request for each product URL. Then create a parse_product method to handle the individual pages. Something like:

def parse(self, response):
    for item in response.css('.product-item'):
        yield scrapy.Request(
            item.css('.product-link::attr(href)').get(),
            callback=self.parse_product,
            meta={'name': item.css('.product-name::text').get()}
        )

Then parse_product can grab the extra details you want. Hope that helps!