Monday, 18 January 2016

Crawling Through Forms and Logins Using Python

One of the first questions that comes up when you start to move beyond the basics of web scraping is: “How do I access information behind a login screen?” The Web is increasingly moving toward interaction, social media, and user-generated content. Forms and logins are an integral part of these types of sites and almost impossible to avoid. Fortunately, they are also relatively easy to deal with.
Up until this point, most of our interactions with web servers in our example scrapers has consisted of using HTTP GET to request information. In this chapter, we’ll focus on the POST method which pushes information to a web server for storage and analysis.
Forms basically give users a way to submit a POST request that the web server can understand and use. Just like link tags on a website help users format GET requests, HTML forms help them format POST requests. Of course, with a little bit of coding, it is possible to simply create these requests ourselves and submit them with a scraper.


Python Requests Library

Although it’s possible to navigate web forms using only the Python core libraries, sometimes a little syntactic sugar makes life a lot sweeter. When you start to do more than a basic GET request with urllib it can help to look outside the Python core libraries. 
The Requests library is excellent at handling complicated HTTP requests, cookies, headers, and much more. 
Here’s what Requests creator Kenneth Reitz has to say about Python’s core tools:
Python’s standard urllib2 module provides most of the HTTP capabilities you need, but the API is thoroughly broken. It was built for a different time—and a different web. It requires an enormous amount of work (even method overrides) to perform the simplest of tasks.

Things shouldn’t be this way. Not in Python.
As with any Python library, the Requests library can be installed with any third-party Python library manager, such as pip, or by downloading and installing the source file.




Submitting a Basic Form



Most web forms consist of a few HTML fields, a submit button, and an “action” page, where the actual form processing is done. The HTML fields usually consist of text but might also contain a file upload or some other non-text content. 
Most popular websites block access to their login forms in their robots.txt file, so to play it safe I’ve constructed a series of different types of forms and logins at pythonscraping.com that you can run your web scrapers against. The most basic of these forms is located at http://bit.ly/1AGKPRU.
The entirety of the form is:
<form method="post" action="processing.php">
First name: <input type="text" name="firstname"><br>
Last name: <input type="text" name="lastname"><br>
<input type="submit" value="Submit">
</form>
A couple of things to notice here: first, the name of the two input fields are firstname and lastname. This is important. The names of these fields determine the names of the variable parameters that will be POSTed to the server when the form is submitted. If you want to mimic the action that the form will take when POSTing your own data, you need to make sure that your variable names match up.
The second thing to note is that the action of the form is actually at processing.php (the absolute path is http://bit.ly/1d7TPVk). Any post requests to the form should be made on this page, not on the page that the form itself resides. Remember: the purpose of HTML forms is only to help website visitors format proper requests to send to the page that does the real action. Unless you are doing research to format the request itself, you don’t need to bother much with the page that the form can be found on.
Submitting a form with the Requests library can be done in four lines, including the import and the instruction to print the content (yes, it’s that easy):
import requests

params = {'firstname': 'Ryan', 'lastname': 'Mitchell'}
r = requests.post("http://pythonscraping.com/files/processing.php", data=params)
print(r.text)
After the form is submitted, the script should return with the page’s content:
Hello there, Ryan Mitchell!
This script can be applied to many simple forms encountered on the Internet. The form to sign up for the O’Reilly Media newsletter, for example, looks like this:
<form action="http://post.oreilly.com/client/o/oreilly/forms/
              quicksignup.cgi" id="example_form2" method="POST">
    <input name="client_token" type="hidden" value="oreilly" /> 
    <input name="subscribe" type="hidden" value="optin" /> 
    <input name="success_url" type="hidden" value="http://oreilly.com/store/
                 newsletter-thankyou.html" /> 
    <input name="error_url" type="hidden" value="http://oreilly.com/store/
                 newsletter-signup-error.html" /> 
    <input name="topic_or_dod" type="hidden" value="1" /> 
    <input name="source" type="hidden" value="orm-home-t1-dotd" />
    <fieldset>
        <input class="email_address long" maxlength="200" name=
                     "email_addr" size="25" type="text" value=
                     "Enter your email here" />
        <button alt="Join" class="skinny" name="submit" onclick=
                       "return addClickTracking('orm','ebook','rightrail','dod'
                                                );" value="submit">Join</button>
    </fieldset>
</form>
Although it can look daunting at first, remember that in most cases (we’ll cover the exceptions later), you’re only looking for two things:
Just add in the required information and run it:
import requests
params = {'email_addr': 'ryan.e.mitchell@gmail.com'}
r = requests.post("http://post.oreilly.com/client/o/oreilly/forms/
                   quicksignup.cgi", data=params)
print(r.text)
In this case, the website returned is simply another form to fill out, before you can actually make it onto O’Reilly’s mailing list, but the same concept could be applied to that form as well. However, I would request that you use your powers for good, and not spam the publisher with invalid signups, if you want to try this at home.

Radio Buttons, Checkboxes, and Other Inputs

Obviously, not all web forms are a collection of text fields followed by a submit button. Standard HTML contains a wide variety of possible form input fields: radio buttons, checkboxes, and select boxes, to name a few. In HTML5, there’s the addition of sliders (range input fields), email, dates, and more. With custom JavaScript fields the possibilities are endless, with colorpickers, calendars, and whatever else the developers come up with next.
Regardless of the seeming complexity of any sort of form field, there are only two things you need to worry about: the name of the element and its value. The element’s name can be easily determined by looking at the source code and finding the name attribute. The value can sometimes be trickier, as it might be populated by JavaScript immediately before form submission. Colorpickers, as an example of a fairly exotic form field, will likely have a value of something like #F03030.

If you’re unsure of the format of an input field’s value, there are a number of tools you can use to track the GET and POST requests your browser is sending to and from sites. The best and perhaps most obvious way to track GET requests, as mentioned before, is to simply look at the URL of a site. If the URL is something like:
http://domainname.com?thing1=foo&thing2=bar
You know that this corresponds to a form of this type:
<form method="GET" action="someProcessor.php">
<input type="someCrazyInputType" name="thing1" value="foo" />
<input type="anotherCrazyInputType" name="thing2" value="bar" />
<input type="submit" value="Submit" />
</form>
Which corresponds to the Python parameter object:
{'thing1':'foo', 'thing2':'bar'}
You can see this in Figure 9-1.
If you’re stuck with a complicated-looking POST form, and you want to see exactly which parameters your browser is sending to the server, the easiest way is to use your browser’s inspector or developer tool to view them.



Alt Text
Figure 9-1. The Form Data section, highlighted in a box, shows the POST parameters “thing1” and “thing2” with their values “foo” and “bar”


The Chrome developer tool can be accessed via the menu by going to View → Developer → Developer Tools. It provides a list of all queries that your browser produces while interacting with the current website and can be a good way to view the composition of these queries in detail.



Submitting Files and Images

Although file uploads are common on the Internet, file uploads are not something often used in web scraping. It is possible, however, that you might want to write a test for your own site that involves a file upload. At any rate, it’s a useful thing to know how to do.
There is a practice file upload form at http://pythonscraping/files/form2.html. The form on the page has the following markup:
<form action="processing2.php" method="post" enctype="multipart/form-data">
  Submit a jpg, png, or gif: <input type="file" name="image"><br>
  <input type="submit" value="Upload File">
</form>
Except for the <input> tag having the type attribute file, it looks essentially the same as the text-based forms used in the previous examples. Fortunately, the way the forms are used by the Python Requests library is also very similar:
import requests

files = {'uploadFile': open('../files/Python-logo.png', 'rb')}
r = requests.post("http://pythonscraping.com/pages/processing2.php", 
                  files=files)
print(r.text)
Note that in lieu of a simple string, the value submitted to the form field (with the name uploadFile) is now a Python File object, as returned by the open function. In this example, I am submitting an image file, stored on my local machine, at the path ../files/Python-logo.png, relative to where the Python script is being run from.
Yes, it’s really that easy!

Handling Logins and Cookies



So far, we’ve mostly discussed forms that allow you submit information to a site or let you to view needed information on the page immediately after the form. How is this different from a login form, which lets you exist in a permanent “logged in” state throughout your visit to the site?
Most modern websites use cookies to keep track of who is logged in and who is not. Once a site authenticates your login credentials a it stores in your browser a cookie, which usually contains a server-generated token, timeout, and tracking information. The site then uses this cookie as a sort of proof of authentication, which is shown to each page you visit during your time on the site. Before the widespread use of cookies in the mid-90s, keeping users securely authenticated and tracking them was a huge problem for websites.
Although cookies are a great solution for web developers, they can be problematic for web scrapers. You can submit a login form all day long, but if you don’t keep track of the cookie the form sends back to you afterward, the next page you visit will act as though you’ve never logged in at all.
I’ve created a simple login form at http://bit.ly/1KwvSSG (the username can be anything, but the password must be “password”).
This form is processed at http://bit.ly/1d7U2I1, and contains a link to the “main site” page, http://bit.ly/1JcansT.
If you attempt to access the welcome page or the profile page without logging in first, you’ll get an error message and instructions to log in first before continuing. On the profile page, a check is done on your browser’s cookies to see whether its cookie was set on the login page.
Keeping track of cookies is easy with theRequests library:
import requests

params = {'username': 'Ryan', 'password': 'password'}
r = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", params)
print("Cookie is set to:")
print(r.cookies.get_dict())
print("-----------")
print("Going to profile page...")
r = requests.get("http://pythonscraping.com/pages/cookies/profile.php", 
                 cookies=r.cookies)
print(r.text)
Here I am sending the login parameters to the welcome page, which acts as the processor for the login form. I retrieve the cookies from the results of the last request, print the result for verification, and then send them to the profile page by setting the cookies argument.
This works well for simple situations, but what if you’re dealing with a more complicated site that frequently modifies cookies without warning, or if you’d rather not even think about the cookies to begin with? The Requests session function works perfectly in this case:
import requests

session = requests.Session()

params = {'username': 'username', 'password': 'password'}
s = session.post("http://pythonscraping.com/pages/cookies/welcome.php", params)
print("Cookie is set to:")
print(s.cookies.get_dict())
print("-----------")
print("Going to profile page...")
s = session.get("http://pythonscraping.com/pages/cookies/profile.php")
print(s.text)
In this case, the session object (retrieved by calling requests.Session()) keeps track of session information, such as cookies, headers, and even information about protocols you might be running on top of HTTP, such as HTTPAdapters.
Requests is a fantastic library, second perhaps only to Selenium (which we’ll cover in Chapter 10) in the completeness of what it handles without programmers having to think about it or write the code themselves. Although it might be tempting to sit back and let the library do all the work, it’s extremely important to always be aware of what the cookies look like and what they are controlling when writing web scrapers. It could save many hours of painful debugging or figuring out why a website is behaving strangely!



HTTP Basic Access Authentication



Before the advent of cookies, one popular way to handle logins was with HTTP basic access authentication. You still see it from time to time, especially on high-security or corporate sites, and with some APIs. I’ve created a page at http://pythonscraping.com/pages/auth/login.php that has this type of authentication (Figure 9-2).




Alt Text
Figure 9-2. The user must provide a username and password to get to the page protected by basic access authentication
As usual with these examples, you can log in with any username, but the password must be “password.”
The Requests package contains an auth module specifically designed to handle HTTP authentication:
import requests
from requests.auth import AuthBase
from requests.auth import HTTPBasicAuth

auth = HTTPBasicAuth('ryan', 'password')
r = requests.post(url="http://pythonscraping.com/pages/auth/login.php", auth=
                   auth)
print(r.text)
Although this appears to be a normal POST request, an HTTPBasicAuth object is passed as the auth argument in the request. The resulting text will be the page protected by the username and password (or an Access Denied page, if the request failed).

No comments:

Post a Comment