Amazon Scraping with Scrapy, Part I - Scraping

Written by Jordan H.

Recently I was doing a research project trying to determine information on the top sellers at Amazon.

In this first part I describe my process of scraping the necessary data from Amazon. In Part II I’ll discuss some of my findings.


Amazon Scraping

Full disclosure that I’m usually not on Amazon. I prefer shopping local as much as I can to bolster my local economy.

I’d encourage others to do the same this Christmas season.

The Goal of the Project

First: the objective. I wanted to find out what countries Amazon sellers are from to see how many are in the United States.

But how do I know what sellers there are?

My first mode of business was to find out if Amazon lists their sellers. I figured Amazon wouldn’t list all of their sellers on one page, let alone have an API where I can get all the sellers.

I needed to use a scraper.

By far the best open source scraper available is Scrapy.

First, though, I needed to determine how to find these sellers.

Investigating the URL Scemes

Invididual Seller Pages

Turns out Amazon has seller pages.

And on that page there’s an address.

seller address

I found this out by choosing a random product. For most (but not all) Amazon lists the seller on the right side of the page (typically says Fullfilled by...).

So I’ll eventually scrape that, I realize. Since I know I’m going to scrape this page, and others like this, I need a way to determine the patter for the seller page. I’ll take a look at the URL in the address bar:

https://www.amazon.com/sp/?seller=A1W4F5UCY68C8L

OK, no mystery here. Looks like each sellers has a code, which in this case is A1W4F5UCY68C8L.

Now we need to determine where the list of sellers are.

List Of All the Sellers

Here I just did a quick DuckDuckGo search for “list of Amazon sellers” The first page that popped up was this page:

https://www.amazon.com/gp/search/other/ref=sr_in_-2_1?rh=i%3Aappliances%2Cn%3A2619525011&page=2&pickerToList=enc-merchantbin&ie=UTF8&qid=1607290974

Let’s save that URL; bookmark it or save it to a text file. We’ll return to it later.

top appliance sellers

Getting the Seller URL from the List

I right clicked on one of the sellers and copied the URL. Here is one for “Goodman’s:”

https://www.amazon.com/s/ref=sr_in_-2_p_6_19?fst=as%3Aoff&rh=n%3A2619525011%2Cp_6%3AA1H5N2R2HL9LG2&bbn=2619525011&ie=UTF8&qid=1607287774&rnid=2661622011

Let’s examine the URL in more detail. For this I like to use a REST API client for parsing the parts of the URL. I spun up Insomnia and imported the URL as arguments.

insomnia URL parsing

The objective here is to find the seller code within the URL. This will help us when constructing the seller address page.

Here it’s a bit more hidden. Essentially what we’re looking for is a random string with alphanumeric characters. Most everything looks numeric, except for the rh string:

n:2619525011,p_6:A1H5N2R2HL9LG2

In fact. Let’s test that we can get the seller info page. Remember earlier the page for the “AntiAgingBed” seller?

https://www.amazon.com/sp/?seller=A1W4F5UCY68C8L

See that seller code? Let’s replace A1W4F5UCY68C8L with A1H5N2R2HL9LG2, and we should have seller info for “Goodman’s:”

https://www.amazon.com/sp/?seller=A1H5N2R2HL9LG2

Goodman's Seller Page

And Bingo was his name–oh!

Note that there’s no seller address for Goodman’s. I noticed that for a lot of the seller info pages, so I just skipped them in data collection.

List of Departments

So, we’re ready to start scraping, right?

Well, not so fast. Based on the information provided we just have the information for the “appliances department.” Amazon is not just an appliance store (and based on your experience you may say it’s not even an appliance store).

Remember that “Best Appliance Sellers” page? Let’s examine that URL:

https://www.amazon.com/gp/search/other?page=2&pickerToList=enc-merchantbin&rh=n%3A2619525011

Here it is parsed with Insomnia:

Goodman's Seller Page

Based on inference (and experimentation) we can determine what each of these query paramaters does:

page (optional)
I’d imagine if the result was paginated, this would be on one of the pages. However, if you change it to `1` or `7` the results remain the same.
pickerToList (required)
I have a feeling this is needed. To determine if is, I remove it from the URL and try the same page. A “Sorry” page pops up with a cute little dog, showing me the department page isn’t found.
rh (required)
OK, this seems interesting. Obviously the only other query parameter we have left that could relate to the department code is this. I make a note of it for later.

We just need to find a page that lists all Amazon departments. In fact, there is:

https://www.amazon.com/gp/site-directory

All Amazon Departments

Now, remember the rh query parameter for the “Appliances” department? It was 3A2619525011. Let’s find the “Appliances” department, parse the URL, and find a match.

We’re doing this because this page is the entry point of the spider. We’ll be able to go through each department and get the sellers associated with the department.

Let’s do a simple search to find the “Appliances” link. It’s highlighted below.

Amazon Appliances Highlighted.

The URL is here:

https://www.amazon.com/s?_encoding=UTF8&bbn=256643011&ref_=sd_allcat_nav_desktop_sa_intl_appliances&rh=i%3Aspecialty-aps,n%3A256643011,n%3A!468240,n%3A13397451

Here is what it looks like in Insomnia:

Insomnia

If you navigate to the URL, it won’t go to the “Top Sellers” like before. The URL for that is here:

https://www.amazon.com/gp/search/other/ref=sr_in_-2_1?rh=i%3Aappliances%2Cn%3A2619525011&page=2&pickerToList=enc-merchantbin&ie=UTF8&qid=1607290974

Insomnia

See any similarities?

Remember, are trying to construct a URL similar to the 2nd URL from information from the 1st URL.

What I see is the rh string from the 1st URL:

i:specialty-aps,n:256643011,n:!468240,n:13397451

Specifically, the bbn value (256643011), which matches the n value from the rh query param in the 2nd URL (256643011).

Initial Program Flow

Scraping the “All departments” page

So, we have a plan for scraping the department page:

  1. Find all department links.
  2. For each department link, extract the n value from the rh query parameter.

From there we can scrape the “top sellers” page for each department.

Scraping the “Top Sellers” of each department.

Using the URL from the previous step (using the n value), fetch the HTML from that page.

  1. Find all seller links on the page.
  2. Get the name (stripping all leading and trailing spaces), and as an added bouns, the product count in the parentheses.
  3. Get the URL for the seller.

Scraping the seller information page.

  1. Construct a URL for the seller page based on the URL extracted above (specifically the n value of the rh query param). Translate it to the rh query paramter for the seller info page.

Amazon Didn’t Like It

Amazon didn’t like to be scraped and I found it blocked me quite a bit. I’ll describe later how I managed to fake out Amazon and get some data.

For now Let me describe how I planned to store the data.

Storing the Data

To save time we’ll save as much in a database as possible.

I chose MongoDB for storage, since a lot of this info looks heirarchical and I didn’t feel like spending a half hour to drum up a schema. If this was a longer project with more needs I would have used PostgreSQL.

Departments

I wanted to first find all departments on the page and record them. At this point I didn’t care if the seller info page exists…I’ll take care of that later.

So our first spider would only be fetching one page: https://www.amazon.com/gp/site-directory.

Here is the redacted code for the all_department spider:

class AzStoreDirectorySpider(scrapy.Spider):
    name = "amazon_store_directory"
    allowed_domains = ["amazon.com"]
    start_urls = [
        "https://www.amazon.com/gp/site-directory",
    ]

    def department_rh(self, name: T.AnyStr, node: T.Union[int, T.AnyStr]):
        return f"i:{name},n:{node}"

    def get_department_info(self, url: T.AnyStr):
        """
        :param url: URL for department

        :return: (name, node)
        """
        parts = urlparse(url)
        query = parse_qs(parts.query)
        if "rh" not in query:
            log.warn("`rh` key not in %s", list(url))
        rh = query["rh"][0]
        res = RE_NAME_NODE(rh)
        if not res:
            log.warning("name and node not found for rh='%s', url='%s", rh, url)
            return None
        name = res["name"]
        node = res["node"]
        return (res["name"], node)

    def department_query(
        self, name: T.AnyStr, node: T.Union[int, T.AnyStr], indexField=None
    ):
        val = {
            "rh": self.department_rh(name, node),
            "pickerToList": "enc-merchantbin",
            "ie": "UTF8",
        }
        if indexField:
            val.update(
                {
                    "indexField": indexField,
                }
            )
        return val

    def queue_department_page(
        self, name: T.AnyStr, node: T.Union[int, T.AnyStr], indexField=None
    ):
        query = urlencode(self.department_query(name, node))
        parts = (
            "https",
            "www.amazon.com",
            DEPARTMENT_PATH,
            query,
            None,
        )
        url = urlunsplit(parts)
        log.info("+ queue '%s' -- '%s'", name, url)
        yield scrapy.Request(url)

    def process_department_link(
        self,
        response: scrapy.http.Response,
        dep_link: T.Any,
    ):
        a = dep_link
        name = a.text
        href = a["href"]
        url = urlparse(href)
        query = parse_qs(url.query)
        log.info("Processing %s", name)
        if "node" not in query:
            log.info("`node` not given in query: %s", list(query.keys()))
            return
        yield from self.queue_department_page(name, query["node"][0])
        if self.mg_department.find_one(
            {
                "name": name,
            }
        ):
            log.info("# %s already added", name)
            return
        doc = {
            "name": name,
            "path": url.path,
            "query": query,
        }
        log.info("+ inserting %s", doc)
        self.mg_department.insert_one(doc)

    def parse(self, response: scrapy.http.Response):
        soup = BeautifulSoup(response.text, "lxml")
        dept_links = soup.find_all("a", class_="fsdDeptLink")
        log.info("Found %d department links", len(dept_links))
        for a in dept_links:
            yield from self.process_department_link(response, a)

I used BeautifulSoup for parsing since this was my first attempt at a spider. Scrapy has its own XML parsing mechanism.

I tried to print out lots of warnings that indicated if something went wrong. I’m not expecting Amazon pages to be consistent (as you saw with the seller above) but it gives me an idea if somethings is REALLY wrong.

I mark a link as invalid if I’m unable to come up with a seller page, as is what happens from time to time. Obviously, the seller address will not be collected.

What’s important to note is that I’m both storing the department in the MongoDB collection (in process_department_link) as well as constructing a department URL and passing it to Scrapy (in queue_department_page).

When storing the department document I use the following schema:

{
    "_id" : <Object ID>,
    "name" : <String of the Department ID>,
    "path" : <URL Path portion>,
    "query" : <SubDocument of query params>,
    "invalid" : <boolean - Does the department page exist?>,
}

For example, here is the stored document of “Appliances.”

{
  "_id": ObjectId("5fc70f5e1244453a23c34067"),
  "name": "Appliances",
  "path": "/Appliances/b",
  "query": {
    "ie": ["UTF8"],
    "node": ["2619525011"],
    "ref_": ["sd_allcat_ha"]
  }
}

Sellers

This step is pretty intensive. My initial draft was single-threaded, and after parsing it stored the result. I found this was too slow, so I decided to spawn off MongoDB storage on a different thread (hence the executor).

Ready? It’s a long one. I try not to include long-winded code samples, but couldn’t find ways of making this shorter…

class AmazonTopSellersSpider(scrapy.Spider):
    name = "amazon_top_sellers"
    allowed_domains = ["amazon.com"]
    switch_user_agent_range = (3, 5)

    def __init__(self, *args, **kwargs):
        super(AmazonSpider, self).__init__(*args, **kwargs)

        self.mg_client = MongoClient()
        self.mg_db = self.mg_client.amazon_scraping

        self.mg_department = self.mg_db.department
        self.mg_top_sellers = mg_db.top_sellers

        # URLs of department sellers
        self.department_urls = self.get_department_urls()
        _u = self.department_urls

        # Shuffle this.
        random.shuffle(self.department_urls)

        self.start_urls = self.department_urls

        self.future_to_add_sellers = {
            "task_add_sellers": {},
            "task_update_product_count": {},
        }

        self.db_executor = concurrent.futures.ThreadPoolExecutor(max_workers=2)
        for future in concurrent.futures.as_completed(self.future_to_add_sellers["task_add_sellers"]):
            items_added = self.future_to_add_sellers["task_add_sellers"][future]
            log.debug("FUTURE: %d items added to %s", len(items_added), future)
        for future in concurrent.futures.as_completed(self.future_to_add_sellers["task_update_product_count"]):
            prodCountUpdated = self.future_to_add_sellers["task_update_product_count"][
                future
            ]
            log.debug(
                "FUTURE: '%s' product count updated",
                prodCountUpdated.get("productCount", "<?>"),
            )

    def get_department_urls(self):
        urls = []
        selector = {
            "invalid": {
                "$exists": False,
            },
        }
        departments = self.mg_department.find(selector)
        for department in departments:
            urls.extend(get_department_urls_from_doc(department))
        return urls

    def response_in_salt(self, resp: scrapy.http.Response):
        return resp.url in self.salt_urls or resp.url in self._mark_salt

    def find_seller_department(self, node_id) -> dict:
        if not isinstance(node_id, (list, tuple)):
            node_id = [node_id]
        query = {"query.node": node_id}
        dep = self.mg_department.find_one(query)
        return dep

    def add_seller(
        self, seller_url: str, seller_name: str, indexField: str = None, _logger=log
    ):
        url = urlparse(seller_url)
        qs = parse_qs(url.query)
        bnn = qs.get("bbn", None)
        dep = None
        dep_id = None
        if indexField is None:
            indexField = "__top__"
        found = self.mg_top_sellers.count(
            {
                "name": seller_name,
                "indexField": indexField,
            }
        )
        if found:
            # _logger.info(
            #     "SKIP - Seller %s (index='%s') already exists: %s",
            #     seller_name,
            #     indexField,
            #     found,
            # )
            return
        if bnn is None:
            _logger.warn(
                "Seller '%s': bbn param not given; leaving blank; url='%s'",
                seller_name,
                seller_url,
            )
            pass
        else:
            dep = self.find_seller_department(bnn)
            if not dep:
                _logger.warning(
                    "Seller '%s': Could not get department for node '%s' (url=%s)",
                    seller_name,
                    bnn,
                    seller_url,
                )
            else:
                dep_id = dep.get("_id")
        doc = {
            "department_id": dep_id,
            "name": seller_name,
            "qs": qs,
            "path": url.path,
            "indexField": indexField,
        }
        _logger.debug("Inserting document %s", doc)
        inserted = self.mg_top_sellers.insert_one(doc)
        return inserted

    def normalize(self, href: str):
        url = urlparse(href)
        scheme = url.scheme or "https"
        netloc = url.netloc or "www.amazon.com"
        path = url.path
        params = url.params
        query = url.query
        fragment = url.fragment
        return urlunparse((scheme, netloc, path, params, query, fragment))

    def add_sellers(self, resp: scrapy.http.Response, _logger=log):
        url = urlparse(resp.url)
        q = parse_qs(url.query)
        indexField = q.get("indexField", None)
        if isinstance(indexField, list):
            indexField = indexField[0]
        added = []
        for link in resp.css(".s-see-all-indexbar-column a"):
            url = self.normalize(link.attrib["href"])
            name = link.xpath("span/text()")[0].get()
            name = name.strip()
            # import ipdb; ipdb.set_trace()
            doc = self.add_seller(url, name, indexField=indexField, _logger=_logger)
            added.append(doc)
        return added

    def update_product_count(self, resp: scrapy.http.Response):
        for link in resp.css(".s-see-all-indexbar-column a"):
            url = self.normalize(link.attrib["href"])
            try:
                name = link.xpath("span/text()")[0].get()
                name = name.strip()
                count = link.xpath("span/text()").getall()[1]
                count = count.strip().strip("(").rstrip(")").replace(",", "")
                count = int(count)
                self.add_seller(url, name)
                self.mg_top_sellers.find_one_and_update(
                    {
                        "name": name,
                    },
                    {
                        "$set": {
                            "productCount": count,
                        }
                    },
                )
                return self.mg_top_sellers.find_one({"name": name})
            except Exception as exc:
                log.warn(exc)
                return

    def parse(self, response: scrapy.http.Response):
        parts = urlparse(response.url)
        qs = parse_qs(parts.query)
        if "gp/search/other" in parts.path:
            self.future_to_add_sellers["task_add_sellers"].update(
                {
                    self.db_executor.submit(self.add_sellers, response): response.url,
                }
            )
            self.future_to_add_sellers["task_update_product_count"].update(
                {
                    self.db_executor.submit(
                        self.update_product_count, response
                    ): response.url,
                }
            )
            # this was the original code, without the executor.
            # self.add_sellers(response, _logger=log)
            # self.update_product_count(response)
            return
        log.info("other url: %s", response.url)

I added the update_product_count as a separate method because I had previously run this without taking product_count into–er–account. However, this works fine even if the seller isn’t yet stored.

This is just the information scraped from the “Top Sellers” page for the department. The data store in Mongo has the following schema:

{
    "_id" : <standard object ID>,
    "department_id" : <ID for the department>,
    "name" : <name of the seller>,
    "qs" : <subdocument of query parameters in the URL>,
    "path" : <Path component of the URL>,
    "productCount" : <The number in the parentheses>,
    "indexField" : <The Index Field, if given; defaults to "__top__">
}

For example, here’s the stored data for the “AntiAgingBed” seller:

{
  "_id": ObjectId("5fc7d467b0957b4fa057f9fa"),
  "department_id": ObjectId("5fc70f5e1244453a23c34064"),
  "name": "AntiAgingBed",
  "qs": {
    "fst": ["as:off"],
    "rh": [
      "n:1055398,n:!1063498,n:1063306,n:1063308,n:3732961,p_6:A1W4F5UCY68C8L"
    ],
    "bbn": ["3732961"],
    "ie": ["UTF8"],
    "qid": ["1606930036"],
    "rnid": ["331544011"]
  },
  "path": "/s/ref=sr_in_-2_p_6_3/146-7467592-8962054",
  "productCount": 14,
  "indexField": "__top__"
}

I included the “indexField” for when I went back later to collect the sellers that were not “top sellers.” I might cover that in another article.

Seller Information

The final stage on the scraping part is to scrape the address of the seller. I use the information from the previous scraper to construct a URL for getting the seller info.

Extracting the seller address is relatively straightfoward. In fact, I just need to supply the XPath to the element:

    def extract_seller_address(self, resp: scrapy.http.Response):
        addrs = [
            i.get()
            for i in resp.xpath(
                "/html/body/div[1]/div[1]/div[2]/div/ul/li[2]/span/ul/li/*/text()"
            )
        ]
        return addrs

For the example of “AntiAgingBed” this updates the document like this:

{
  "address": [
    "1844 West Fairbanks Ave",
    "523",
    "Winter Park",
    "FL",
    "32789",
    "US"
  ]
}

Pretty handy!

Fetching Geographic Information, via ArcGIS

This is all well and good, but I wanted to get the country of origin.

“But you already have the country of origin!” You might say, “Just use that!”

Well, I wouldn’t recommend it. That would have to be a system all on its own. Because what if the seller left off the “US” part of the address? You and I can easily see it’s a U.S. address, but I won’t be able to look for it.

“But that’s because of the ‘FL.’ Just look for states.”

In other words, we’re going down a machine learning rabit hole I don’t want to go down.

In addition, I thought it would be helpful to also collect the GPS coordinates of the address. I’m not going to guestimate all those, manually or automatically.

So what I need to do is called Geocoding, or getting geographic information based off a query string.

You do it all the time with Google Maps (or if we’re privacy-conscious, OpenStreetMaps). If you type in “Grandma’s House,” in the directions, the map will find out where “Grandma’s House” is based on a combination of AI and magic (mostly magic). And we all know where Grandma’s house is! (5 miles past the troll’s cave)

So I’ll let AI do all the work at this point. I first got an API key at ArcGIS, and created one final script to fetch more precise seller coordinates.

Here’s the redacted code.

class SellerFinder(object):

    arcgis_auth_url = "https://www.arcgis.com/sharing/rest/oauth2/token"
    arcgis_search_url = "https://geocode.arcgis.com/arcgis/rest/services/World/GeocodeServer/findAddressCandidates"

    # Before 2 minutes is up, fetch a new auth token.
    arcgis_buffer = timedelta(minutes=2)

    arcgis_auth_headers = {
        "content-type": "application/x-www-form-urlencoded",
        "accept": "application/json",
        "cache-control": "no-cache",
        "postman-token": "11df29d1-17d3-c58c-565f-2ca4092ddf5f",
    }

    gis_for_storage = True

    query_out_fields = "Country,X,Y,Match_addr,Addr_type,Score,URL"

    arcgis_token_expiration = 20160  # 2 weeks

    def __init__(self):

        if not ARCGIS_CLIENT_ID:
            raise RuntimeError("ARCGIS_CLIENT_ID not set")
        if not ARCGIS_CLIENT_SECRET:
            raise RuntimeError("ARCGIS_CLIENT_SECRET not set")

        self.mg_client = MongoClient()
        self.mg_db = self.mg_client.amazon_scraping

        self.mg_department = self.mg_db.department
        self.mg_top_sellers = self.mg_db.top_sellers

        self.arcgis_token = None
        self.arcgis_token_expiration = None

    @property
    def sellers_with_addresses(self):
        """ Returns all the sellers that do have an address string but
        do not have a "geo" document.
        """
        return self.mg_top_sellers.find(
            {
                "address": {
                    "$exists": True,
                },
                "geo": {
                    "$exists": False,
                },
            }
        )

    def auth_arcgis(self):
        """ Housekeeping: basically updates the arcgis token.
        This will run on the first iteration, and execute if the token will expire soon.
        """
        log.debug("Requesting new arcgis token")
        resp = requests.post(
            self.arcgis_auth_url,
            data=self.auth_argcis_payload,
            headers=self.arcgis_auth_headers,
        )
        data = resp.json()
        self.arcgis_token = data["access_token"]
        self.arcgis_token_expiration = datetime.now() + timedelta(
            minutes=data["expires_in"]
        )
        log.debug("New expiration token will expire %s", self.arcgis_token_expiration)

    @property
    def auth_argcis_payload(self):
        return f"client_id={ARCGIS_CLIENT_ID}&client_secret={ARCGIS_CLIENT_SECRET}&grant_type=client_credentials&expiration={self.arcgis_token_expiration}"

    def get_arcgis_search_params(self, query: T.AnyStr):
        return {
            "f": "json",
            "singleLine": query,
            "token": self.arcgis_token,
            "outFields": self.query_out_fields,
            ## IMPORTANT!!! -- `gis_for_storage` must be set to `True`
            # https://developers.arcgis.com/rest/geocode/api-reference/geocoding-free-vs-paid.htm
            "forStorage": self.gis_for_storage,
        }

    def argcis_search(self, query: T.AnyStr):
        """ Performs the arcgis search.

        :param query: the query given in the mongodb document.

        :return: Json data of the response.
        """
        payload = self.get_arcgis_search_params(query)
        resp = requests.get(self.arcgis_search_url, params=payload)
        if resp.status_code != 200:
            log.error(resp.request.url)
            log.error("GET %s : %s", resp.status_code, resp.text)
            return None
        return resp.json()

    def seller_location(self, seller: dict):
        """ This updates the seller information.

        :param seller: seller information fetched from mongo.

        First it fetches the seller information from mongo. Then it performs the
        arcgis query (based on the entire "address" string if given.)
        """
        search = " ".join(seller["address"])
        log.info("Looking for '%s'", search)
        result = self.argcis_search(search)
        if result is None:
            log.warning("Could not find location for %s", seller["name"])
            return None
        log.debug("Setting geo for '%s'", seller["name"])
        log.debug(pprint.pformat(result))
        self.mg_top_sellers.find_one_and_update(
            {
                "_id": seller["_id"],
            },
            {
                "$set": {
                    "geo": result,
                }
            },
        )

    @property
    def arcgis_about_to_expire(self):
        return datetime.now() + self.arcgis_buffer > self.arcgis_token_expiration

    def update_arcgis_auth(self):
        if self.arcgis_token_expiration is None or self.arcgis_about_to_expire:
            self.auth_arcgis()

    def run(self):
        for seller in self.sellers_with_addresses:
            self.update_arcgis_auth()
            self.seller_location(seller)


def main():
    finder = SellerFinder()
    finder.run()


if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    main()

It’s important to note that the forStorage query argument must be True in order for the data to be stored legally.

As such, it’s a paid account. Only about 500 queries could be done on the free tier; since I was going to be doing more than that I decided to do about $20 more of queries.

Since I was not able to extract address information from “Goodman’s” I was not able to get geo information either. However, the code below shows an example of geo data retrieved from another seller.

In Part II I’ll show a sneaky way in MongoDB to make this (as well as other data) more statistics-friendly.

"geo" : {
    "spatialReference" : {
        "wkid" : 4326,
        "latestWkid" : 4326
    },
    "candidates" : [
        {
            "address" : "39158, Ridgeland, Mississippi",
            "location" : {
                "x" : -90.123555,
                "y" : 32.409505
            },
            "score" : 98,
            "attributes" : {
                "Score" : 98,
                "Match_addr" : "39158, Ridgeland, Mississippi",
                "Addr_type" : "Postal",
                "URL" : "",
                "Country" : "USA",
                "X" : -90.123555,
                "Y" : 32.409505
            },
            "extent" : {
                "xmin" : -90.1285549999999,
                "ymin" : 32.404505,
                "xmax" : -90.118555,
                "ymax" : 32.414505
            }
        }
    ]
}

This was the last step of where to find the sellers. We now have a guessed “Country” value that’s most probably right.

When Amazon Blocks You

amazon captcha

It may come as no surprise that Amazon does block you if you scrape. But the good news is that you can navigate the site, so Amazon doesn’t block everyone , just people who act like a bot.

Most annoying is the fact that Scrapy doesn’t see when Amazon blocks you.

“Why’s that?” You ask.

Good question, astute reader.

Mainly because the captcha page returns a response code 200, so scrapy doesn’t see that as an error, and keeps right on truckin’.

To detect if a captcha is found, I created some DownloaderMiddleware to search for part of the captcha notice. If it’s found, then the middleware raises a 403 error, and notifies the spider that it needs to invoke the salt shaker (described below).

For now, just note how I threw a new status code.

class GenericAmazonMiddleware:

    def response_is_captcha(self, resp: Response):
        # This string is in the response if a captcha is detected.
        return resp.status == 200 and ("api-services-support@amazon.com" in resp.text)

    def amazon_is_sorry(self, resp: Response):
        if not resp.css("h2"):
            return False
        h2 = resp.css("h2")
        return any(["we're sorry!" in h.get().lower() for h in h2])

    def process_response(self, request: Request, response: Response, spider: Spider):
        if self.amazon_is_sorry(response):
            # This is raised if a page is not found, specifically a seller.
            # this still returns a response code of 200, so Scrapy doesn't detect it.
            spider.logger.warning('Amazon said "We\'re sorry!" for %s', response.url)
            on_amazon_sorry = getattr(spider, "on_amazon_sorry", None)
            if on_amazon_sorry:
                on_amazon_sorry(self, request, response)
            raise IgnoreRequest()
        if self.response_is_captcha(response):
            # Same with captchas. But in this instance we want to ask the throttle
            # to slow down (although, the damage has probably already been done).
            # To tell scrapy to slow down, we throw an error 403.
            spider.logger.warning(
                "Encountered captcha; sleeping for %d seconds, adding '%s' to queue",
                CAPTCHA_SLEEP_TIME,
                request.url,
            )
            time.sleep(CAPTCHA_SLEEP_TIME)
            # I'm sure there's a better way to do this. This is just a quick and
            # dirty method.
            setattr(spider, "__invoke_shaker", True)
            response.status = 403
            spider.start_urls.append(request.url)
        return response

So after a while I started to see "Encountered captcha; ..." in the log output.

Here are a few strategies I found that were very effective in being able to scrape the site.

Change User Agent.

The user agent is a string sent with every query to the server. This is used for several purposes, most notably for web site maintainers to know how to render a page. But it’s also used by sites for tracking you (as well as spam detection).

Switching user agents took care of my blocks about 90% of the time. Scrapy makes this easy with the scrapy-user-agents.

All this does is change the user agent string every so often.

Set a proper AUTOTHROTTLE_START_DELAY and AUTOTHROTTLE_MAX_DELAY

I set mine as follows:

AUTOTHROTTLE_START_DELAY = 15
AUTOTHROTTLE_MAX_DELAY = 160

It seemed to do just fine.

If your links are oddly specific and repetitive, it doesn’t take a security expert to detect you’re a bot.

Let’s add some salt to our navigation.

The term “salt” is used in cryptography to denote entropy in a cryptographic string (a fancy term to mean “a bunch of garbage to confuse someone decrypting the string who’s not supposed to”). I decided to use the same approach to scraping.

Essentially I provided initial breadcrumbs to Amazon to “prove” that I wasn’t a bot. This involves adding addition links to the start_urls property in the spider, and crawling those as if I’m navigating to them.

I added salt URLs as the first 5-7 URLs preceding the actual scraping, and sprinkled in some more salt URLs into the actual start_urls.

I incorporated this as a middleware, and listend for the __invoke_shaker attribute set by the GenericAmazonMiddleware:

class SaltShakerMiddleware:

    salt_urls: T.Iterable[T.AnyStr] = []
    initial_salt_k = 4

    SALT_USAGE = 100
    SHAKE_PROBABILITY = 25

    # Set this to whatever you want...better yet, set it up in
    # settings!
    salt_urls = [
        "https://www.amazon.com/b?node=16225009011",
        "https://www.amazon.com/gp/customer-preferences/select-currency/ref=aistrust3",
        "https://www.amazon.com/business?_encoding=UTF8&ref_=footer_retail_b2b",
        "https://www.amazon.com/gp/help/customer/display.html?ie=UTF8&nodeId=GDFU3JS5AL6SYHRD&ref_=footer_covid",
        "https://www.amazon.jobs/",
        "https://blog.aboutamazon.com/?utm_source=gateway&utm_medium=footer",
    ]

    def __init__(self, *args, **kwargs):
        super(SaltShakerMiddleware, self).__init__(*args, **kwargs)
        self._mark_salt = []

    def spider_opened(self, spider: Spider):
        spider.logger.info("Adding salt to spider '%s'", spider.name)
        self.add_salt(spider)

    def normalize(self, href: str):
        url = urlparse(href)
        scheme = url.scheme or "https"
        netloc = url.netloc or "www.amazon.com"
        path = url.path
        params = url.params
        query = url.query
        fragment = url.fragment
        return urlunparse((scheme, netloc, path, params, query, fragment))

    def shake(self, resp: Response, spider: Spider):
        links = resp.xpath("//a/@href").getall()
        if not links:
            spider.logger.warning("!! NO-LINK: No links found for %s", resp.url)
            return
        n = max(min(len(links), 10), 3)
        k = random.randint(3, n)
        # We'll choose a random selection of the links...again, a human thing to do.
        chosen = random.sample(links, k=min(k, len(links)))
        for link in chosen:
            lnk = self.normalize(link)
            self._mark_salt.append(lnk)
            spider.logger.debug("salt shaker: <%s>", lnk)
            resp.follow(lnk)

    def maybe_shake(self, response: Response, spider: Spider):

        shake_salt = random.randint(0, 100) < self.SHAKE_PROBABILITY
        if getattr(spider, "__invoke_shaker", False) is True:
            self.shake(response, spider)
        elif len(self._mark_salt) < self.SALT_USAGE and shake_salt:
            spider.logger.info(".: invoking salt shaker :.")
            self.shake(response, spider)

    def add_salt(self, spider: Spider):

        random.shuffle(self.salt_urls)

        # Start with an initial salt to throw Amazon off
        self.initial_salt = random.sample(self.salt_urls, k=self.initial_salt_k)

        spider.start_urls = self.salt_urls + spider.start_urls
        random.shuffle(spider.start_urls)
        spider.start_urls = self.initial_salt + spider.start_urls

        self._mark_salt = []

    def process_response(self, request: Request, response: Response, spider: Spider):
        self.maybe_shake(response, spider)
        return response

Because we want to stay focused on scraping I’ve set a limit for the amount of salt we’ll chase. This is handled by the SALT_USAGE class variable; if the length of _mark_salt hits SALT_USAGE, we don’t scrape any more random links.

If this is not the case there are 2 cases in which the salt shaker is executed:

  • A SHAKE_PROBABILITY percent chance (e.g. 25% in this example)
  • If a captcha is detected. Assure Amazon we’re really not scraping the site (but we are). In this case, the shake is invoked through the spider by the GenericAmazonMiddleware.

Lessons Learned

That was probably the longest most-invoved article I’ve written. And it’s only the first part!

Here are some take-aways:

  • Scrapy is an awesome scraping tool!
  • MongoDB makes an excellent storage engine.
  • Amazon does not like to be scraped.
  • Save as much as you can first round so you don’t have to scrape a 2nd time.
  • All it takes is a little investigative work (and the help of a REST API client!) to find the seller codes.
  • We first needed to find the department URL from Amazon’s “All Departments” page. From there, we navigated to each department, extracting each seller’s URL in the “Top Sellers.” From there, we went to a third stage, of scraping the seller’s page (if it exists) to extract the address.
  • Amazon returns http status code 200 on being blocked, so additional detection techniques have to be employed.
  • 3rd party middleware, or your own, can be used to get around some of the blocks.
  • Reverse geocoding can be done with ArcGIS (but remember to use the forStorage flag!)

In the part 2 (which I hope to post next week) I’ll talk about my findings. Where are the top sellers on Amazon?

Did you like this article?

Did you know I'm available for hire?

Send me an email here: hireme@jordanhewitt.pro

Or connect with me in Wire: @nswered