Monday, 18 January 2016

Crawling Through Forms and Logins Using Python

One of the first questions that comes up when you start to move beyond the basics of web scraping is: “How do I access information behind a login screen?” The Web is increasingly moving toward interaction, social media, and user-generated content. Forms and logins are an integral part of these types of sites and almost impossible to avoid. Fortunately, they are also relatively easy to deal with.
Up until this point, most of our interactions with web servers in our example scrapers has consisted of using HTTP GET to request information. In this chapter, we’ll focus on the POST method which pushes information to a web server for storage and analysis.
Forms basically give users a way to submit a POST request that the web server can understand and use. Just like link tags on a website help users format GET requests, HTML forms help them format POST requests. Of course, with a little bit of coding, it is possible to simply create these requests ourselves and submit them with a scraper.


Python Requests Library

Although it’s possible to navigate web forms using only the Python core libraries, sometimes a little syntactic sugar makes life a lot sweeter. When you start to do more than a basic GET request with urllib it can help to look outside the Python core libraries. 
The Requests library is excellent at handling complicated HTTP requests, cookies, headers, and much more. 
Here’s what Requests creator Kenneth Reitz has to say about Python’s core tools:
Python’s standard urllib2 module provides most of the HTTP capabilities you need, but the API is thoroughly broken. It was built for a different time—and a different web. It requires an enormous amount of work (even method overrides) to perform the simplest of tasks.

Things shouldn’t be this way. Not in Python.
As with any Python library, the Requests library can be installed with any third-party Python library manager, such as pip, or by downloading and installing the source file.




Submitting a Basic Form



Most web forms consist of a few HTML fields, a submit button, and an “action” page, where the actual form processing is done. The HTML fields usually consist of text but might also contain a file upload or some other non-text content. 
Most popular websites block access to their login forms in their robots.txt file, so to play it safe I’ve constructed a series of different types of forms and logins at pythonscraping.com that you can run your web scrapers against. The most basic of these forms is located at http://bit.ly/1AGKPRU.
The entirety of the form is:
<form method="post" action="processing.php">
First name: <input type="text" name="firstname"><br>
Last name: <input type="text" name="lastname"><br>
<input type="submit" value="Submit">
</form>
A couple of things to notice here: first, the name of the two input fields are firstname and lastname. This is important. The names of these fields determine the names of the variable parameters that will be POSTed to the server when the form is submitted. If you want to mimic the action that the form will take when POSTing your own data, you need to make sure that your variable names match up.
The second thing to note is that the action of the form is actually at processing.php (the absolute path is http://bit.ly/1d7TPVk). Any post requests to the form should be made on this page, not on the page that the form itself resides. Remember: the purpose of HTML forms is only to help website visitors format proper requests to send to the page that does the real action. Unless you are doing research to format the request itself, you don’t need to bother much with the page that the form can be found on.
Submitting a form with the Requests library can be done in four lines, including the import and the instruction to print the content (yes, it’s that easy):
import requests

params = {'firstname': 'Ryan', 'lastname': 'Mitchell'}
r = requests.post("http://pythonscraping.com/files/processing.php", data=params)
print(r.text)
After the form is submitted, the script should return with the page’s content:
Hello there, Ryan Mitchell!
This script can be applied to many simple forms encountered on the Internet. The form to sign up for the O’Reilly Media newsletter, for example, looks like this:
<form action="http://post.oreilly.com/client/o/oreilly/forms/
              quicksignup.cgi" id="example_form2" method="POST">
    <input name="client_token" type="hidden" value="oreilly" /> 
    <input name="subscribe" type="hidden" value="optin" /> 
    <input name="success_url" type="hidden" value="http://oreilly.com/store/
                 newsletter-thankyou.html" /> 
    <input name="error_url" type="hidden" value="http://oreilly.com/store/
                 newsletter-signup-error.html" /> 
    <input name="topic_or_dod" type="hidden" value="1" /> 
    <input name="source" type="hidden" value="orm-home-t1-dotd" />
    <fieldset>
        <input class="email_address long" maxlength="200" name=
                     "email_addr" size="25" type="text" value=
                     "Enter your email here" />
        <button alt="Join" class="skinny" name="submit" onclick=
                       "return addClickTracking('orm','ebook','rightrail','dod'
                                                );" value="submit">Join</button>
    </fieldset>
</form>
Although it can look daunting at first, remember that in most cases (we’ll cover the exceptions later), you’re only looking for two things:
Just add in the required information and run it:
import requests
params = {'email_addr': 'ryan.e.mitchell@gmail.com'}
r = requests.post("http://post.oreilly.com/client/o/oreilly/forms/
                   quicksignup.cgi", data=params)
print(r.text)
In this case, the website returned is simply another form to fill out, before you can actually make it onto O’Reilly’s mailing list, but the same concept could be applied to that form as well. However, I would request that you use your powers for good, and not spam the publisher with invalid signups, if you want to try this at home.

Radio Buttons, Checkboxes, and Other Inputs

Obviously, not all web forms are a collection of text fields followed by a submit button. Standard HTML contains a wide variety of possible form input fields: radio buttons, checkboxes, and select boxes, to name a few. In HTML5, there’s the addition of sliders (range input fields), email, dates, and more. With custom JavaScript fields the possibilities are endless, with colorpickers, calendars, and whatever else the developers come up with next.
Regardless of the seeming complexity of any sort of form field, there are only two things you need to worry about: the name of the element and its value. The element’s name can be easily determined by looking at the source code and finding the name attribute. The value can sometimes be trickier, as it might be populated by JavaScript immediately before form submission. Colorpickers, as an example of a fairly exotic form field, will likely have a value of something like #F03030.

If you’re unsure of the format of an input field’s value, there are a number of tools you can use to track the GET and POST requests your browser is sending to and from sites. The best and perhaps most obvious way to track GET requests, as mentioned before, is to simply look at the URL of a site. If the URL is something like:
http://domainname.com?thing1=foo&thing2=bar
You know that this corresponds to a form of this type:
<form method="GET" action="someProcessor.php">
<input type="someCrazyInputType" name="thing1" value="foo" />
<input type="anotherCrazyInputType" name="thing2" value="bar" />
<input type="submit" value="Submit" />
</form>
Which corresponds to the Python parameter object:
{'thing1':'foo', 'thing2':'bar'}
You can see this in Figure 9-1.
If you’re stuck with a complicated-looking POST form, and you want to see exactly which parameters your browser is sending to the server, the easiest way is to use your browser’s inspector or developer tool to view them.



Alt Text
Figure 9-1. The Form Data section, highlighted in a box, shows the POST parameters “thing1” and “thing2” with their values “foo” and “bar”


The Chrome developer tool can be accessed via the menu by going to View → Developer → Developer Tools. It provides a list of all queries that your browser produces while interacting with the current website and can be a good way to view the composition of these queries in detail.



Submitting Files and Images

Although file uploads are common on the Internet, file uploads are not something often used in web scraping. It is possible, however, that you might want to write a test for your own site that involves a file upload. At any rate, it’s a useful thing to know how to do.
There is a practice file upload form at http://pythonscraping/files/form2.html. The form on the page has the following markup:
<form action="processing2.php" method="post" enctype="multipart/form-data">
  Submit a jpg, png, or gif: <input type="file" name="image"><br>
  <input type="submit" value="Upload File">
</form>
Except for the <input> tag having the type attribute file, it looks essentially the same as the text-based forms used in the previous examples. Fortunately, the way the forms are used by the Python Requests library is also very similar:
import requests

files = {'uploadFile': open('../files/Python-logo.png', 'rb')}
r = requests.post("http://pythonscraping.com/pages/processing2.php", 
                  files=files)
print(r.text)
Note that in lieu of a simple string, the value submitted to the form field (with the name uploadFile) is now a Python File object, as returned by the open function. In this example, I am submitting an image file, stored on my local machine, at the path ../files/Python-logo.png, relative to where the Python script is being run from.
Yes, it’s really that easy!

Handling Logins and Cookies



So far, we’ve mostly discussed forms that allow you submit information to a site or let you to view needed information on the page immediately after the form. How is this different from a login form, which lets you exist in a permanent “logged in” state throughout your visit to the site?
Most modern websites use cookies to keep track of who is logged in and who is not. Once a site authenticates your login credentials a it stores in your browser a cookie, which usually contains a server-generated token, timeout, and tracking information. The site then uses this cookie as a sort of proof of authentication, which is shown to each page you visit during your time on the site. Before the widespread use of cookies in the mid-90s, keeping users securely authenticated and tracking them was a huge problem for websites.
Although cookies are a great solution for web developers, they can be problematic for web scrapers. You can submit a login form all day long, but if you don’t keep track of the cookie the form sends back to you afterward, the next page you visit will act as though you’ve never logged in at all.
I’ve created a simple login form at http://bit.ly/1KwvSSG (the username can be anything, but the password must be “password”).
This form is processed at http://bit.ly/1d7U2I1, and contains a link to the “main site” page, http://bit.ly/1JcansT.
If you attempt to access the welcome page or the profile page without logging in first, you’ll get an error message and instructions to log in first before continuing. On the profile page, a check is done on your browser’s cookies to see whether its cookie was set on the login page.
Keeping track of cookies is easy with theRequests library:
import requests

params = {'username': 'Ryan', 'password': 'password'}
r = requests.post("http://pythonscraping.com/pages/cookies/welcome.php", params)
print("Cookie is set to:")
print(r.cookies.get_dict())
print("-----------")
print("Going to profile page...")
r = requests.get("http://pythonscraping.com/pages/cookies/profile.php", 
                 cookies=r.cookies)
print(r.text)
Here I am sending the login parameters to the welcome page, which acts as the processor for the login form. I retrieve the cookies from the results of the last request, print the result for verification, and then send them to the profile page by setting the cookies argument.
This works well for simple situations, but what if you’re dealing with a more complicated site that frequently modifies cookies without warning, or if you’d rather not even think about the cookies to begin with? The Requests session function works perfectly in this case:
import requests

session = requests.Session()

params = {'username': 'username', 'password': 'password'}
s = session.post("http://pythonscraping.com/pages/cookies/welcome.php", params)
print("Cookie is set to:")
print(s.cookies.get_dict())
print("-----------")
print("Going to profile page...")
s = session.get("http://pythonscraping.com/pages/cookies/profile.php")
print(s.text)
In this case, the session object (retrieved by calling requests.Session()) keeps track of session information, such as cookies, headers, and even information about protocols you might be running on top of HTTP, such as HTTPAdapters.
Requests is a fantastic library, second perhaps only to Selenium (which we’ll cover in Chapter 10) in the completeness of what it handles without programmers having to think about it or write the code themselves. Although it might be tempting to sit back and let the library do all the work, it’s extremely important to always be aware of what the cookies look like and what they are controlling when writing web scrapers. It could save many hours of painful debugging or figuring out why a website is behaving strangely!



HTTP Basic Access Authentication



Before the advent of cookies, one popular way to handle logins was with HTTP basic access authentication. You still see it from time to time, especially on high-security or corporate sites, and with some APIs. I’ve created a page at http://pythonscraping.com/pages/auth/login.php that has this type of authentication (Figure 9-2).




Alt Text
Figure 9-2. The user must provide a username and password to get to the page protected by basic access authentication
As usual with these examples, you can log in with any username, but the password must be “password.”
The Requests package contains an auth module specifically designed to handle HTTP authentication:
import requests
from requests.auth import AuthBase
from requests.auth import HTTPBasicAuth

auth = HTTPBasicAuth('ryan', 'password')
r = requests.post(url="http://pythonscraping.com/pages/auth/login.php", auth=
                   auth)
print(r.text)
Although this appears to be a normal POST request, an HTTPBasicAuth object is passed as the auth argument in the request. The resulting text will be the page protected by the username and password (or an Access Denied page, if the request failed).

Sunday, 8 December 2013

Quotes that Earns in Programming

It's been a while that I got a chance to write a post here. Currently, I'm really busy with school/work (just got a "job" at Initas Technologies - love it!) which takes a lot of my time.
When searching for those inspiring programming quotes, there were loads of others that are really funny (and true) where I (and probably many more) can relate to.
Inspiring Programming Quotes
Here are two of my favourite programming quotes:
“ Java is to JavaScript what Car is to Carpet. - Chris Heilmann
“ It's hard enough to find an error in your code when you're looking for it; it's even harder when you've assumed your code is error-free. ” - Steve McConnell
Check out these other 25 to see if you can relate!
27 inspiring top notch programming quotes
These quotations are in no order.
“ If debugging is the process of removing software bugs, then programming must be the process of putting them in. ” - Edsger Dijkstra
“ Rules of Optimization:
Rule 1: Don't do it.
Rule 2 (for experts only): Don't do it yet.
 ” - Michael A. Jackson
“ The best method for accelerating a computer is the one that boosts it by 9.8 m/s2. ” - Anonymous
“ Walking on water and developing software from a specification are easy if both are frozen. ” - Edward V Berard
“ Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it. ” - Brian Kernighan
“ It's not at all important to get it right the first time. It's vitally important to get it right the last time. ” - Andrew Hunt and David Thomas
“ First, solve the problem. Then, write the code. ” - John Johnson
“ Should array indices start at 0 or 1? My compromise of 0.5 was rejected without, I thought, proper consideration. ” - Stan Kelly-Bootle
“ Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live. ” - Rick Osborne
“ Any fool can write code that a computer can understand. Good programmers write code that humans can understand. ” - Martin Fowler
“ Software sucks because users demand it to. ” - Nathan Myhrvold
“ Linux is only free if your time has no value. ” - Jamie Zawinski
“ Beware of bugs in the above code; I have only proved it correct, not tried it. ” - Donald Knuth
“ There is not now, nor has there ever been, nor will there ever be, any programming language in which it is the least bit difficult to write bad code. ” - Flon's Law
“ The first 90% of the code accounts for the first 90% of the development time. The remaining 10% of the code accounts for the other 90% of the development time. ” - Tom Cargill
“ Good code is its own best documentation. As you're about to add a comment, ask yourself, "How can I improve the code so that this comment isn't needed?" Improve the code and then document it to make it even clearer. ” - Steve McConnell
“ Programs must be written for people to read, and only incidentally for machines to execute. ” - Abelson / Sussman
“ Most software today is very much like an Egyptian pyramid with millions of bricks piled on top of each other, with no structural integrity, but just done by brute force and thousands of slaves. ” - Alan Kay
“ Programming can be fun, so can cryptography; however they should not be combined. ” - Kreitzberg and Shneiderman
“ Copy and paste is a design error. ” - David Parnas
“ Before software can be reusable it first has to be usable. ” - Ralph Johnson
“ Without requirements or design, programming is the art of adding bugs to an empty text file. ” - Louis Srygley
“ When someone says, "I want a programming language in which I need only say what I want done," give him a lollipop. ” - Alan Perlis
“ Computers are good at following instructions, but not at reading your mind. ” - Donald Knuth
“ Any code of your own that you haven't looked at for six or more months might as well have been written by someone else. ” - Eagleson's law