Monday, October 22, 2007

How To: The Urchin Data Extractor part 2

Well I said I would publish my script on Friday. Sorry about that, my wife and I moved into a new house over the weekend and I guess I was just a little over zealous.

Anyway. The script. In the past post I showed you how to (basically) setup the urchin script and make the command line calls to it. In this part, I will show you what I did to automate the process for ytd 2007, including all the months up to the current month. I utilized the bash quite a bit here. I'm sure there is a faster way to automate this without using as many tmp files, but I like to keep my data in stages.

Here is the script:
#!/bin/bash
cd /Users/ChrisCopeland/Apps/scripts/urchin
currentMonth=`date '+%m'`
currentdate=`date '+%Y%m%d'`
currentYear=`date '+%Y'`
i=1
let loopVar=currentMonth+1
echo "Monthly Reports Available :"

while [ $i -lt $loopVar ]; do
echo "$i"
let i=i+1
done
read -p "please select the month or enter ytd for year to date: " -e input
case "$input" in
'ytd')
perl u5data_extractor.pl --begin 20070101 --end $currentdate --max 21000 --report 1201 >>fileTMP;
;;
'1')
perl u5data_extractor.pl --begin 20070101 --end 20070130 --max 21000 --report 1201 >>fileTMP;
;;
'2')
perl u5data_extractor.pl --begin 20070201 --end 20070230 --max 21000 --report 1201 >>fileTMP;
;;
'3')
perl u5data_extractor.pl --begin 20070301 --end 20070330 --max 21000 --report 1201 >>fileTMP;
;;
'4')
perl u5data_extractor.pl --begin 20070401 --end 20070430 --max 21000 --report 1201 >>fileTMP;
;;
'5')
perl u5data_extractor.pl --begin 20070501 --end 20070530 --max 21000 --report 1201 >>fileTMP;
;;
'6')
perl u5data_extractor.pl --begin 20070601 --end 20070630 --max 21000 --report 1201 >>fileTMP;
;;
'7')
perl u5data_extractor.pl --begin 20070701 --end 20070730 --max 21000 --report 1201 >>fileTMP;
;;
'8')
perl u5data_extractor.pl --begin 20070801 --end 20070830 --max 21000 --report 1201 >>fileTMP;
;;
'9')
perl u5data_extractor.pl --begin 20070901 --end 20070930 --max 21000 --report 1201 >>fileTMP;
;;
'10')
perl u5data_extractor.pl --begin 20071001 --end 20071030 --max 21000 --report 1201 >>fileTMP;
;;
'11')
perl u5data_extractor.pl --begin 20071101 --end 20071130 --max 21000 --report 1201 >>fileTMP;
;;
'12')
perl u5data_extractor.pl --begin 20071201 --end 20071230 --max 21000 --report 1201 >>fileTMP;
;;
esac

#cleans the slash for easier editing
tr "/" "_" <>fileTMP2
#start of line removal
sed '
s/_index.cfm//
s/_&safe=vss//
s/_&adlt=strict//
s/.cfm//
s/,//
s/\ $//' <>fileTMP3
cat cleanThese | while read line; do
sed -ie "/$line/d" fileTMP3
done
less supplierTMP3 | cut -c 2-500 | grep -v "^ " | grep -v "^_" | grep -v "^#" >>fileTMPws
tr -s " " fileTMPcond
sed 's/ /,/' /Users/ChrisCopeland/Sites/urchinreports/file$currentYear-Month$input.csv
#cleans temp directory
rm fileTMP*
echo "your report is complete"

Now we can go through the script. The first part should be self explanatory, I am setting the values for the current date, month, and year (year is not yet implemented in my script, but will be soon). Then I ask the user for which report (monthly) they would like to generate, and show them a numerical list of the reports available (from 1 to current month). It will be in this case function that I will implement the current year, so that a user can get the month and year data needed. At the moment we are only interested in the year 2007.

Next comes the hairy part - the cleaning of useless data. I have a file for the most common items I want filtered. The file is called cleanThese (simple name). Before I open that file though, I want to clean certain characters and items which get skipped over do to the fact that this log file which is generated, will have a list of urls and paths in it. Paths and urls have weird characters sometimes, like ";" ":" "/" etc. Try passing these into a command line sometime and you will see how troublesome they can be. So let's get them out of there.

tr "/" "_" - replaced "/" with an underscore, which will make it easier to clean the rest of the log.

Now we throw this whole thing to sed - a great program.
sed '
s/_index.cfm//
s/_&safe=vss//
s/_&adlt=strict//
s/.cfm//
s/,//
s/\ $//' <>fileTMP3

we have a lot of oler cold fusion files, and some items that have a problem in the clean file, this sed command, which is a chained command, one per line, cleans these things out and leave nothing in the pattern's place. You can see now why I cleaned out the extra "/", I would have been passing a /// to sed, which it doesn't understand.

Now the cleanThese file:
cat cleanThese | while read line; do
sed -ie "/$line/d" fileTMP3
done
This reads each line of the cleanThese file (which I can modify at my desire) and replaces the entire line of the patten with nothing, effectively removing the line.

Then I want to clean up the formatting from the output of the cleaning:
less fileTMP3 | cut -c 2-500 | grep -v "^ " | grep -v "^_" | grep -v "^#" >>fileTMPws

This line will cut certain characters out (based on the original output), pass these to grep with an inverted search 3 times looking for different patterns, then writes that out to yet another tmp file.

The next line:
tr -s " " fileTMPcond
compresses all the space characters to one, outputs that to another file, which sed will take in, and replace the now single space, with a "," - effectively making this a csv file, which is names with the current month and set to a directory.

The next couple of lines clean the directory of tmp files and report to the user in the shell that the report is ready.

This will leave you a clean, importable, ready to query csv file just aching to be imported into a SQL engine of some sort.

Again if you want to run a different type of report, or use the case statement to generate a set of reports, you can visit my website to find a list of urchin reports available.

I want to find a good way to make this script available in a web interface at some point. I would also like to give the user a list of reports and years at the front of the application, just to help automate the process further.

Please Enjoy!

No comments: