One of the first questions that comes up when you start to
move beyond the basics of web scraping is: “How do I access information
behind a login screen?” The Web is increasingly moving toward
interaction, social media, and user-generated content. Forms and logins
are an integral part of these types of sites and almost impossible to avoid. Fortunately, they are also relatively easy to deal with.
Up until this point, most of our interactions with web servers in our example scrapers has consisted of using HTTP
GET
to request information. In this chapter, we’ll focus on the POST
method which pushes information to a web server for storage and analysis.
Forms basically give users a way to submit a
POST
request that the web server can understand and use. Just like link tags on a website help users format GET
requests, HTML forms help them format POST
requests. Of course, with a little bit of coding, it is possible to
simply create these requests ourselves and submit them with a scraper.Python Requests Library
Although it’s possible to navigate web forms using only the Python
core libraries, sometimes a little syntactic sugar makes life a lot
sweeter. When you start to do more than a basic
GET
request with urllib
it can help to look outside the Python core libraries.
The Requests library is excellent at handling complicated HTTP requests, cookies, headers, and much more.
Here’s what Requests creator Kenneth Reitz has to say about Python’s core tools:
Python’s standardurllib2
module provides most of the HTTP capabilities you need, but the API is thoroughly broken. It was built for a different time—and a different web. It requires an enormous amount of work (even method overrides) to perform the simplest of tasks.
Things shouldn’t be this way. Not in Python.
As with any Python library, the Requests library can be installed with any third-party Python library manager, such as pip, or by downloading and installing the source file.
Submitting a Basic Form
Most web forms
consist of a few HTML fields, a submit button, and an “action” page,
where the actual form processing is done. The HTML fields usually
consist of text but might also contain a file upload or some other
non-text content.
Most popular websites block access to their login forms in their robots.txt file, so to play it safe I’ve
constructed a series of different types of forms and logins at pythonscraping.com that you can run your web scrapers against. The most basic of these forms is located at http://bit.ly/1AGKPRU.
The entirety of the form is:A couple of things to notice here: first, the name of the two input fields are<
form
method
=
"post"
action
=
"processing.php"
>
First
name
:
<
input
type
=
"text"
name
=
"firstname"
><
br
>
Last
name
:
<
input
type
=
"text"
name
=
"lastname"
><
br
>
<
input
type
=
"submit"
value
=
"Submit"
>
</
form
>
firstname
and lastname
. This is important. The names of these fields determine the names of the variable parameters that will be POST
ed to the server when the form is submitted. If you want to mimic the action that the form will take when POST
ing your own data, you need to make sure that your variable names match up.The second thing to note is that the action of the form is actually at processing.php (the absolute path is http://bit.ly/1d7TPVk). Any
post
requests to the form should be made on this page,
not on the page that the form itself resides. Remember: the purpose of
HTML forms is only to help website visitors format proper requests to
send to the page that does the real action. Unless you are doing
research to format the request itself, you don’t need to bother much
with the page that the form can be found on.Submitting a form with the Requests library can be done in four lines, including the import and the instruction to print the content (yes, it’s that easy):
After the form is submitted, the script should return with the page’s content:import
requests
params
=
{
'firstname'
:
'Ryan'
,
'lastname'
:
'Mitchell'
}
r
=
requests
.
post
(
"http://pythonscraping.com/files/processing.php"
,
data
=
params
)
(
r
.
text
)
This script can be applied to many simple forms encountered on the Internet. The form to sign up for the O’Reilly Media newsletter, for example, looks like this:Hello
there
,
Ryan
Mitchell
!
Although it can look daunting at first, remember that in most cases (we’ll cover the exceptions later), you’re only looking for two things:<
form
action
=
"http://post.oreilly.com/client/o/oreilly/forms/
quicksignup
.
cgi
" id="
example_form2
" method="
POST
">
<
input
name
=
"client_token"
type
=
"hidden"
value
=
"oreilly"
/>
<
input
name
=
"subscribe"
type
=
"hidden"
value
=
"optin"
/>
<
input
name
=
"success_url"
type
=
"hidden"
value
=
"http://oreilly.com/store/
newsletter
-
thankyou
.
html
" />
<
input
name
=
"error_url"
type
=
"hidden"
value
=
"http://oreilly.com/store/
newsletter
-
signup
-
error
.
html
" />
<
input
name
=
"topic_or_dod"
type
=
"hidden"
value
=
"1"
/>
<
input
name
=
"source"
type
=
"hidden"
value
=
"orm-home-t1-dotd"
/>
<
fieldset
>
<
input
class
=
"email_address long"
maxlength
=
"200"
name
=
"email_addr"
size
=
"25"
type
=
"text"
value
=
"Enter your email here"
/>
<
button
alt
=
"Join"
class
=
"skinny"
name
=
"submit"
onclick
=
"return addClickTracking('orm','ebook','rightrail','dod'
);
" value="
submit
">Join</button>
</
fieldset
>
</
form
>
- The name of the field (or fields) you want to submit with data (in this case, the name is
email_address
) - The action attribute of the form itself; that is, the page that the form actually posts to (in this case, http://post.oreilly.com/client/o/oreilly/forms/quicksignup.cgi)
In this case, the website returned is simply another form to fill out, before you can actually make it onto O’Reilly’s mailing list, but the same concept could be applied to that form as well. However, I would request that you use your powers for good, and not spam the publisher with invalid signups, if you want to try this at home.import
requests
params
=
{
'email_addr'
:
'ryan.e.mitchell@gmail.com'
}
r
=
requests
.
post
(
"http://post.oreilly.com/client/o/oreilly/forms/
quicksignup
.
cgi
", data=params)
(
r
.
text
)
Radio Buttons, Checkboxes, and Other Inputs
Obviously, not all web forms are a collection of text fields followed by a submit button. Standard HTML contains a wide variety of possible form input fields: radio buttons, checkboxes, and select boxes, to name a few. In HTML5, there’s the addition of sliders (range input fields), email, dates, and more. With custom JavaScript fields the possibilities are endless, with colorpickers, calendars, and whatever else the developers come up with next.Regardless of the seeming complexity of any sort of form field, there are only two things you need to worry about: the name of the element and its value. The element’s name can be easily determined by looking at the source code and finding the
name
attribute. The value can sometimes be trickier, as it might be populated by JavaScript immediately before form submission. Colorpickers, as an example of a fairly exotic form field, will likely have a value of something like #F03030
.If you’re unsure of the format of an input field’s value, there are a number of tools you can use to track the
GET
and POST
requests your browser is sending to and from sites. The best and perhaps most obvious way to track GET
requests, as mentioned before, is to simply look at the URL of a site. If the URL is something like:
http://domainname.com?thing1=foo&thing2=bar
You know that this corresponds to a form of this type:Which corresponds to the Python parameter object:<
form
method
=
"GET"
action
=
"someProcessor.php"
>
<
input
type
=
"someCrazyInputType"
name
=
"thing1"
value
=
"foo"
/>
<
input
type
=
"anotherCrazyInputType"
name
=
"thing2"
value
=
"bar"
/>
<
input
type
=
"submit"
value
=
"Submit"
/>
</
form
>
You can see this in Figure 9-1.{
'thing1'
:
'foo'
,
'thing2'
:
'bar'
}
If you’re stuck with a complicated-looking
POST
form, and you want to see exactly which parameters your browser is
sending to the server, the easiest way is to use your browser’s
inspector or developer tool to view them.Figure 9-1. The Form Data section, highlighted in a box, shows the POST parameters “thing1” and “thing2” with their values “foo” and “bar”
The Chrome developer tool can be accessed via the menu by going to View → Developer → Developer Tools. It provides a list of all queries that your browser produces while interacting with the current website and can be a good way to view the composition of these queries in detail.
Submitting Files and Images
Although file uploads are common on the Internet, file uploads are not something often used in web scraping. It is possible, however, that you might want to write a test for your own site that involves a file upload. At any rate, it’s a useful thing to know how to do.There is a practice file upload form at http://pythonscraping/files/form2.html. The form on the page has the following markup:
Except for the<
form
action
=
"processing2.php"
method
=
"post"
enctype
=
"multipart/form-data"
>
Submit
a
jpg
,
png
,
or
gif
:
<
input
type
=
"file"
name
=
"image"
><
br
>
<
input
type
=
"submit"
value
=
"Upload File"
>
</
form
>
<input>
tag having the type attribute file
,
it looks essentially the same as the text-based forms used in the
previous examples. Fortunately, the way the forms are used by the Python
Requests library is also very similar:Note that in lieu of a simple string, the value submitted to the form field (with the nameimport
requests
files
=
{
'uploadFile'
:
open
(
'../files/Python-logo.png'
,
'rb'
)}
r
=
requests
.
post
(
"http://pythonscraping.com/pages/processing2.php"
,
files
=
files
)
(
r
.
text
)
uploadFile
) is now a Python File object, as returned by the open
function. In this example, I am submitting an image file, stored on my local machine, at the path ../files/Python-logo.png, relative to where the Python script is being run from.Yes, it’s really that easy!
Handling Logins and Cookies
So far, we’ve
mostly discussed forms that allow you submit information to a site or
let you to view needed information on the page immediately after the
form. How is this different from a login form, which lets you exist in a
permanent “logged in” state throughout your visit to the site?
Most modern websites use cookies to keep track of who is logged in and who is not. Once a site authenticates your login credentials a it stores in your browser a cookie, which usually contains a server-generated token, timeout, and tracking information. The site then uses this cookie as a sort of proof of authentication, which is shown to each page you visit during your time on the site. Before the widespread use of cookies in the mid-90s, keeping users securely authenticated and tracking them was a huge problem for websites.
Although cookies are a great solution for web developers, they can be problematic for web scrapers. You can submit a login form all day long, but if you don’t keep track of the cookie the form sends back to you afterward, the next page you visit will act as though you’ve never logged in at all.
I’ve created a simple login form at http://bit.ly/1KwvSSG (the username can be anything, but the password must be “password”).
This form is processed at http://bit.ly/1d7U2I1, and contains a link to the “main site” page, http://bit.ly/1JcansT.
If you attempt to access the welcome page or the profile page without logging in first, you’ll get an error message and instructions to log in first before continuing. On the profile page, a check is done on your browser’s cookies to see whether its cookie was set on the login page.
Keeping track of cookies is easy with theRequests library:
This works well for simple situations, but what if you’re dealing with a more complicated site that frequently modifies cookies without warning, or if you’d rather not even think about the cookies to begin with? The Requests
Requests is a fantastic library, second perhaps only to Selenium (which we’ll cover in Chapter 10) in the completeness of what it handles without programmers having to think about it or write the code themselves. Although it might be tempting to sit back and let the library do all the work, it’s extremely important to always be aware of what the cookies look like and what they are controlling when writing web scrapers. It could save many hours of painful debugging or figuring out why a website is behaving strangely!
Most modern websites use cookies to keep track of who is logged in and who is not. Once a site authenticates your login credentials a it stores in your browser a cookie, which usually contains a server-generated token, timeout, and tracking information. The site then uses this cookie as a sort of proof of authentication, which is shown to each page you visit during your time on the site. Before the widespread use of cookies in the mid-90s, keeping users securely authenticated and tracking them was a huge problem for websites.
Although cookies are a great solution for web developers, they can be problematic for web scrapers. You can submit a login form all day long, but if you don’t keep track of the cookie the form sends back to you afterward, the next page you visit will act as though you’ve never logged in at all.
I’ve created a simple login form at http://bit.ly/1KwvSSG (the username can be anything, but the password must be “password”).
This form is processed at http://bit.ly/1d7U2I1, and contains a link to the “main site” page, http://bit.ly/1JcansT.
If you attempt to access the welcome page or the profile page without logging in first, you’ll get an error message and instructions to log in first before continuing. On the profile page, a check is done on your browser’s cookies to see whether its cookie was set on the login page.
Keeping track of cookies is easy with theRequests library:
Here I am sending the login parameters to the welcome page, which acts as the processor for the login form. I retrieve the cookies from the results of the last request, print the result for verification, and then send them to the profile page by setting theimport
requests
params
=
{
'username'
:
'Ryan'
,
'password'
:
'password'
}
r
=
requests
.
post
(
"http://pythonscraping.com/pages/cookies/welcome.php"
,
params
)
(
"Cookie is set to:"
)
(
r
.
cookies
.
get_dict
())
(
"-----------"
)
(
"Going to profile page..."
)
r
=
requests
.
get
(
"http://pythonscraping.com/pages/cookies/profile.php"
,
cookies
=
r
.
cookies
)
(
r
.
text
)
cookies
argument.This works well for simple situations, but what if you’re dealing with a more complicated site that frequently modifies cookies without warning, or if you’d rather not even think about the cookies to begin with? The Requests
session
function works perfectly in this case:In this case, the session object (retrieved by callingimport
requests
session
=
requests
.
Session
()
params
=
{
'username'
:
'username'
,
'password'
:
'password'
}
s
=
session
.
post
(
"http://pythonscraping.com/pages/cookies/welcome.php"
,
params
)
(
"Cookie is set to:"
)
(
s
.
cookies
.
get_dict
())
(
"-----------"
)
(
"Going to profile page..."
)
s
=
session
.
get
(
"http://pythonscraping.com/pages/cookies/profile.php"
)
(
s
.
text
)
requests.Session()
)
keeps track of session information, such as cookies, headers, and even
information about protocols you might be running on top of HTTP, such as
HTTPAdapters.Requests is a fantastic library, second perhaps only to Selenium (which we’ll cover in Chapter 10) in the completeness of what it handles without programmers having to think about it or write the code themselves. Although it might be tempting to sit back and let the library do all the work, it’s extremely important to always be aware of what the cookies look like and what they are controlling when writing web scrapers. It could save many hours of painful debugging or figuring out why a website is behaving strangely!
HTTP Basic Access Authentication
Before the advent of cookies, one popular way to handle logins was with HTTP basic access authentication.
You still see it from time to time, especially on high-security or
corporate sites, and with some APIs. I’ve created a page at http://pythonscraping.com/pages/auth/login.php that has this type of authentication (Figure 9-2).
As usual with these examples, you can log in with any username, but the password must be “password.”
The Requests package contains an
As usual with these examples, you can log in with any username, but the password must be “password.”
The Requests package contains an
auth
module specifically designed to handle HTTP authentication:Although this appears to be a normalimport
requests
from
requests.auth
import
AuthBase
from
requests.auth
import
HTTPBasicAuth
auth
=
HTTPBasicAuth
(
'ryan'
,
'password'
)
r
=
requests
.
post
(
url
=
"http://pythonscraping.com/pages/auth/login.php"
,
auth
=
auth
)
(
r
.
text
)
POST
request, an HTTPBasicAuth
object is passed as the auth
argument in the request. The resulting text will be the page protected by the username and password (or an Access Denied page, if the request failed).