How to Use a Proxy with Python Requests
Proxies are essential for any web scraping process. In fact, when scraping large amounts of data from websites, proxies are necessary. There are two common ways to prevent websites from blocking you while you’re scraping them. These are the changing IP addresses and using proxies.
While rotating ISP IPs is simple, combining the two techniques helps you get past most of the anti-scraping measures and prevent being detected as a scraper and blocked. This article will focus on using a Python requests library behind proxies to minimize your chances of getting banned.
What is a Proxy?
A proxy is a third-party server that enables you to route your request through their servers and use their IP address while scraping. This means that the website you are scraping or making a request to can’t see your device’s IP address, but the IP address of the proxy. It gives you the ability to remain anonymous, enhance secure scraping, and avoid getting blocked as you scrape the web.
Python Requests and Proxies
To use a proxy in Python,
1. Add the HTTP Requests package to your project.
The requests library is the most popular library that simplifies the process of sending HTTP requests. It is not contained in the Python distribution but is the most functional as it makes the Python code for working with HTTP brief, easy, and straightforward. To import the requests package,
import requests
proxies = {
“http”: “http://10.10.10.10:8000”,
“https”: “http://10.10.10.10:8000”,
}
r = requests.get(“http://toscrape.com”, proxies=proxies)
2. Create a Proxy Dictionary
The next step is to create a proxy dictionary to define the HTTP and HTTPS connections. This dictionary maps a protocol to the proxy URL. Remember to set the URL variable to the webpage you’re scraping.
Proxy Authentication
It is not enough to just define the proxy address and port. You also need to specify the protocol. However, you can use the same proxy for multiple protocols. If you need authentication use this syntax for your proxy:
http://user:pass@10.10.10.10:8000
Environmental Variables
You can also define the proxies that you’re using for each individual request. If you don’t need to use the same proxies, just set some environment variables:
export HTTP_PROXY=’http://10.10.10.10:8000′
export HTTPS_PROXY=’http://10.10.10.10:1212′
This way you don’t need to define any proxies in your code. Just make the request and it will work.
Proxy With Sessions
Sometimes you may find yourself scraping a website that uses sessions. In this case, you need to create a session and use a proxy at the same time to request a page. First, you have to create a new session object and add proxies to it, and then finally send the proxies using the request method, only this time through the URL as the argument.
import requests
session = requests.Session()
session.proxies = {
‘http’: ‘http://10.10.10.10:8000’,
‘https’: ‘http://10.10.10.10:8000’,
}
url = ‘http://mywebsite.com/example’
response = session.get(url)
IP Rotating
As we mentioned earlier, a common problem that we encounter while attempting to scrape the web is getting blocked. Sometimes it gets frustrating that we can’t scrape without accessing the website we want data from. When using a single proxy, there is a good chance your scraper will be blocked and your IP address banned. Therefore, the solution to this problem is to use multiple rotating proxies. A proxy solution will let you get around the IP ban by assigning a new IP address for each connection.
To rotate IPs, first, we need to have a pool of IP addresses, or you can use free proxies on the internet. However, if your product or service relies on scraped data, the free proxy solution may not satisfy your scraping needs.