Bilal K.

Bilal K.

Scraping the Amazon.

Scraping the Amazon.

Subscribe to my newsletter and never miss my upcoming articles

Python is so powerful!

This is the second article of my web scraping guide. In the first article,I showed how to use BeautifulSoup & Requests to perform a quick and effective web-scraping for TSLA Stock Price Alert, an email was automatically generated to notify the price drop.

Scraping the single pages using beautiful soup is easy but what if you want to scrap the multiple pages at a time or want to automate this process? Yes, it is possible to do it manually but this way is too time-consuming, not scalable, and highly not recommended.

Image Credits: Real PythonImage Credits: Real Python

In this post, we’ll learn how to use BeautifulSoup & Requests to perform a quick and effective techniques to scrap the product titles from multiple pages at Amazon. Let’s dive in.

import requests
from bs4 import BeautifulSoup

def prod_title():
    url = 'https://www.amazon.com/s?k=macbook&ref=nb_sb_noss_2'
    HEADERS = ({'User-Agent':
                    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
                    'Accept-Language': 'en-US, en;q=0.5'})
    request = requests.get(url, headers= HEADERS)
    soup = BeautifulSoup(request.text,'lxml')
    for i in soup.find_all('span', 'a-size-medium a-color-base a-text-normal'):
        title = i.get_text()
        #print (i)
        print ("Title- ",title, '\n')

    base_url= 'https://www.amazon.com'

    for link in soup.find_all('li', class_ = 'a-normal'):
        href = link.find('a')
        href = href.get('href')
        url = base_url + href
        #print (url)
        #print ("I am here\n\n")

        request = requests.get(url, headers= HEADERS)
        soup = BeautifulSoup(request.text,'lxml')

        for i in soup.find_all('span', 'a-size-medium a-color-base a-text-normal'):
            title = i.get_text()
            print ("Title- ",title, '\n')


prod_title()

1-Import The Necessary Libraries.

import requests
from bs4 import BeautifulSoup

2-Loading Webpages with request.

The requests module helps to send HTTP requests in Python. So the next step is to inspect the required data classes from the complete source code. We select the card and click on the ‘Inspect Element’ option to get the source code of that particular section. You will get something similar to this:

Inspecting the titles.Inspecting the titles.

Now we need to write a function that could scrap the title and price from the webpage.

def prod_title():
    url = 'https://www.amazon.com/s?k=macbook&ref=nb_sb_noss_2'
    HEADERS = ({'User-Agent':
                    'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36',
                    'Accept-Language': 'en-US, en;q=0.5'})
    request = requests.get(url, headers= HEADERS)
    soup = BeautifulSoup(request.text,'lxml')
    for i in soup.find_all('span', 'a-size-medium a-color-base a-text-normal'):
        title = i.get_text()
        #print (i)
        print ("Title- ",title, '\n')

Up till now, we have scraped the titles of the first page. Now we want to scrap the titles from the other pages.

If we inspect the pagination box at the bottom of the page, we can see the href for the second and third page.

Inspecting Pagination box.Inspecting Pagination box.

3-Analyzing the URLs.

Let’s see what page 2’s URL look like:

https://www.amazon.com/s?k=macbook&page=2&qid=1608130203&ref=sr_pg_2

And then page 3’s URL:

https://www.amazon.com/s?k=macbook&page=3&qid=1608131542&ref=sr_pg_3

Here, we can notice that URL after amazon.com can be scraped from our first page content and if we combine the href of second and third page to our base url (amazon.com) we can scrap the other pages along with page 01.

base_url= 'https://www.amazon.com'

    for link in soup.find_all('li', class_ = 'a-normal'):
        href = link.find('a')
        href = href.get('href')
        url = base_url + href

4- Repeating the procedure.

Now, we have the URLs of other pages required to scrap, Let us scrap the other pages along with the first page.

 base_url= 'https://www.amazon.com'

    for link in soup.find_all('li', class_ = 'a-normal'):
        href = link.find('a')
        href = href.get('href')
        url = base_url + href
        #print (url)
        #print ("I am here\n\n")

        request = requests.get(url, headers= HEADERS)
        soup = BeautifulSoup(request.text,'lxml')

        for i in soup.find_all('span', 'a-size-medium a-color-base a-text-normal'):
            title = i.get_text()
            print ("Title- ",title, '\n')

So that’s it! I hope you found it helpful, and also I hope that you have a nice day.

Till next time :) Happy Learning.

Bilal Khan.

 
Share this