January 16th, 2019

Setting headers on Scrapy to request JSON versions of websites/APIs

Scrapy is a great tool for scraping info off of websites. Recently I was trying to pull info via Scrapy from EventBrite’s API tools. I say trying because instead of getting a JSON response like I was expecting, it was returning a full HTML webpage. Not very helpful when trying to parse JSON.

EventBrite’s API is a little unique because they supply a very useful web interface to interact with while building the queries. However, when using Scrapy, it becomes less useful and more of a hindrance.

EventBrite's API Interface

A screenshot of the EventBrite API page.

I suspected EventBrite was looking at the request headers and returning a specific view based on if it was requesting HTML or JSON. Scrapy, being a web scraper, defaults to requesting the HTML version of pages.

Setting the headers for Scrapy is straight-forward:

import scrapy
import json

class scrapyHeaderSpider(scrapy.Spider):
    name = "scrapy_header"

    # This is a built-in Scrapy function that runs first where we'll override the default headers
    # Documentation: https://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spiders.Spider.start_requests
    def start_requests(self):
        url = "https://www.eventbriteapi.com/v3/organizers/[ORG_ID]/events/?token=[YOUR_TOKEN]"

        # Set the headers here. The important part is "application/json"
        headers =  {
            'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.84 Safari/537.36',
            'Accept': 'application/json,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
            'Accept-Encoding': 'gzip, deflate, sdch',
            'Accept-Language': 'en-US,en;q=0.8,zh-CN;q=0.6,zh;q=0.4',

        yield scrapy.http.Request(url, headers=headers)

    def parse(self, response):
        parsedJson = json.loads(response.body)

Then run via:

scrapy runspider scrapy_header.py

That’s it!

Learn More

If you want to learn more about Scrapy's default settings, the documentation on it is here.

Why use Scrapy for a JSON endpoint?

"Why are you using Scrapy for something that could easily be solved by just using Requests?"

That's true. In most cases, doing something like this is much simpler:

response = requests.get("http://api.open-notify.org/iss-now.json")

However, there may be an instance that you need to set a header in Scrapy, so hopefully this tutorial is useful to someone.