acloudtree

Tag Screen

(Nerd) Python2.6, Screen scraping, and Javascript cookies

Recently I tried *scraping some data from a website and was running into problems. I don’t have a fix at the moment but I made the first big break through.

My first attempt at scraping the data with Python was met with immediate denial. And I was able to get similar results (though not exact) by disabling cookies in my browser (firefox3.5) and accessing the desired site. The fact that the results were not identical confused me some. But I figured it was a subtle difference in the way Firefox handled the request versus how I was handling the request programmatically with Python, mechanize, urllib2, and cookielib.

Still, after several hours I still was unable to make the desired request to the server. So I started doing some digging. It turns out that these libraries are unable to automatically handle cookies set by Javascript. So, to test this, I disabled Javascript in my browser, made the request, and got the exact same results. YES!!!

As a quick test I was able to extract the cookies’ value using the LiveHeader extension in Firefox. I then took this value and manually assigned it to the header of my Python request. I then got the desired results in my Python program. I’ll post an example of my solution when I get it up and running.

Troubleshooting:

In your browser I would do the following in order to try and recreate what is happening in your program.

  1. Disable Javascript
  2. Disable Cookies
  3. Access headers with Firefox plugin

*Programmatically extracting data from a website

Copyright © Jared Folkins
Programming, Computers, Writing, Economics, and Life

Powered by WordPress