acloudtree

Category OOP

(Nerd) Mechanize & Javascript

This is from the mechanize site, wish I would have read it before I started.

Since Javascript is completely visible to the client, it cannot be used to prevent a scraper from following links. But it can make life difficult, and until someone writes a Javascript interpreter for Perl or a Mechanize clone to control Firefox, there will be no general solution. But if you want to scrape specific pages, then a solution is always possible.

One typical use of Javascript is to perform argument checking before posting to the server. The URL you want is probably just buried in the Javascript function. Do a regular expression match on $mech->content() to find the link that you want and $mech->get it directly (this assumes that you know what you are looking for in advance).

In more difficult cases, the Javascript is used for URL mangling to satisfy the needs of some middleware. In this case you need to figure out what the Javascript is doing (why are these URLs always really long?). There is probably some function with one or more arguments which calculates the new URL. Step one: using your favorite browser, get the before and after URLs and save them to files. Edit each file, converting the the argument separators (‘?’, ‘&’ or ‘;’) into newlines. Now it is easy to use diff or comm to find out what Javascript did to the URL. Step 2 – find the function call which created the URL – you will need to parse and interpret its argument list. Using the Javascript Debugger Extension for Firefox may help with the analysis. At this point, it is fairly trivial to write your own function which emulates the Javascript for the pages you want to process.

Here’s annother approach that answers the question, “It works in Firefox, but why not Mech?” Everything the web server knows about the client is present in the HTTP request. If two requests are identical, the results should be identical. So the real question is “What is different between the mech request and the Firefox request?”

The Firefox extension “Tamper Data” is an effective tool for examining the headers of the requests to the server. Compare that with what LWP is sending. Once the two are identical, the action of the server should be the same as well.

I say “should”, because this is an oversimplification – some values are naturally unique, e.g. a SessionID, but if a SessionID is present, that is probably sufficient, even though the value will be different between the LWP request and the Firefox request. The server could use the session to store information which is troublesome, but that’s not the first place to look (and highly unlikely to be relevant when you are requesting the login page of your site).

Generally the problem is to be found in missing or incorrect POSTDATA arguments, Cookies, User-Agents, Accepts, etc. If you are using mech, then redirects and cookies should not be a problem, but are listed here for completeness. If you are missing headers, $mech->add_header can be used to add the headers that you need.

LINK

(Nerd) Ubuntu + Vim + ZendFramework-1.9.2 + .pthml syntax highlighting

1) From the command line ‘cd’ to your ‘home’ directory


test-box@jbuntu:~$ cd

2a) Check to see if the .vimrc file exists


test-box@jbuntu:~$ ls .vimrc

If the terminal outputs nothing, then that means the file does not exist.

2b) If you get the following


test-box@jbuntu:~$ ls .vimrc
.vimrc

It means that the file does exist and we just need to edit it.

3) If the file does not exist just ‘touch’ the file. If it DOES exist, just skip this step.


test-box@jbuntu:~$ touch .vimrc

4) From this point, ‘vi’ the ‘.vimrc’ file. You primarily need the following lines and you are more than welcome to copy/paste. Write/Quite when finished.


if has("autocmd")
autocmd BufEnter *.phtml set syn=php
endif
syn on

Now the next time you open VI it should have the desired highlighting for .phtml files found in the Zend Framework.

Below is my ‘.vimrc’ file in it’s entirety. Just for the record. It also allows for syntax highlighting to occur in CakePHP .ctp files along with some other settings that I prefer.


set tabstop=2
set shiftwidth=2
set expandtab
if has("autocmd")
autocmd BufEnter *.ctp set syn=php
autocmd BufEnter *.phtml set syn=php
endif
syn on
set ai

Let my dataset change your mindset

If you follow the link provided. It will take you to a brilliant video by Hans Rosling. My hope, is that anyone watching at least begins to understand the power of data. Maybe not the exact method to manipulate the stuff, but at least the desire to know more about it.

This video is quite timely for me. For I have thought long and hard about putting together a series of posts on data analysis. Particularly on the work I have done in looking at the Deschutes County Clerks Data.

For those of you who don’t know. In the fall of 2007 I created a program that went to the county website and would pull public records and sales data concerning housing. It would throw this data into a usable database.  And I would use this data to educate family and friends on when our housing bubble was started, and how bad it really was.

I built a small website with this data, but stopped my work on it because of several complex reasons.

Anyway, if people are interested in how to analyze data, or would appreciate an ongoing conversation about it, let me know.

iPhone Methods : Shifty & Pulse

ShiftyAndPulseMethods

Over on www.iphonedevsdk.com someone was asking how to make their screen pulse. These methods are from a little experiment I did. They fire and visually show the user when their login attempt to the server fails. Pretty basic, and not perfect, but hopefully they get people started.

PS : sorry for watermark, I do not record my screen often enough to justify expense.

Copyright © Jared Folkins
Programming, Computers, Writing, Economics, and Life

Powered by WordPress