Friday, September 14, 2007

More Fun with Lynx

I grew up using gopher servers before there was a www or http, so when the real "web" came along it was needless to say awesome. One of the first web browsers I used was Lynx.

Lynx is a very very simple browser, very useful in scripts and for checking to see how a search engine views the webpage. If lynx cannot see your content, it is very doubtful that a seach-bot will see it too.

So the last post shows how to use lynx to call Google's caching times. This will show you how you can automate lynx to do automatic retrieval of web information for you.

Here is a simple script which will read a file in line by line and pass the information off to lynx for a Google search.

#!/bin/bash
cat ${1} | while read mySearchTerm; do
lynx -source -accept_all_cookies "http://www.google.com/search?q=$mySearchTerm"
done

This script will throw everything to the standard out. What I do is pass this information on to a text file or to grep for counting purposes.

#!/bin/bash
cat ${1} | while read mySearchTerm; do
lynx -source -accept_all_cookies "http://www.google.com/search?q=$mySearchTerm" |grep -c 'pattern.to.count'>> /path/to/text/file.txt
done

and now we have auto document retreival from Google. A word of warning, because this will take whatever is in the line, you must be careful of non-alpha numeric characters like !@#$%^&*-\/, as these will be passed on to Google too, which can alter the search results. You can also use things like the 'date' command or other small *nix programs to alter the url fed to lynx. If you want to time this sort of script you can always use crontab functionality found in unix, linux, os x. Be sure to read up on the man page for lynx.

Enjoy.

No comments: