The Ramblings of a Geek with A.D.D.: Perl

Monday, October 22, 2007

How To: The Urchin Data Extractor part 2

Well I said I would publish my script on Friday. Sorry about that, my wife and I moved into a new house over the weekend and I guess I was just a little over zealous.

Anyway. The script. In the past post I showed you how to (basically) setup the urchin script and make the command line calls to it. In this part, I will show you what I did to automate the process for ytd 2007, including all the months up to the current month. I utilized the bash quite a bit here. I'm sure there is a faster way to automate this without using as many tmp files, but I like to keep my data in stages.

Here is the script:
#!/bin/bash
cd /Users/ChrisCopeland/Apps/scripts/urchin
currentMonth=`date '+%m'`
currentdate=`date '+%Y%m%d'`
currentYear=`date '+%Y'`
i=1
let loopVar=currentMonth+1
echo "Monthly Reports Available :"

while [ $i -lt $loopVar ]; do
echo "$i"
let i=i+1
done
read -p "please select the month or enter ytd for year to date: " -e input
case "$input" in
'ytd')
perl u5data_extractor.pl --begin 20070101 --end $currentdate --max 21000 --report 1201 >>fileTMP;
;;
'1')
perl u5data_extractor.pl --begin 20070101 --end 20070130 --max 21000 --report 1201 >>fileTMP;
;;
'2')
perl u5data_extractor.pl --begin 20070201 --end 20070230 --max 21000 --report 1201 >>fileTMP;
;;
'3')
perl u5data_extractor.pl --begin 20070301 --end 20070330 --max 21000 --report 1201 >>fileTMP;
;;
'4')
perl u5data_extractor.pl --begin 20070401 --end 20070430 --max 21000 --report 1201 >>fileTMP;
;;
'5')
perl u5data_extractor.pl --begin 20070501 --end 20070530 --max 21000 --report 1201 >>fileTMP;
;;
'6')
perl u5data_extractor.pl --begin 20070601 --end 20070630 --max 21000 --report 1201 >>fileTMP;
;;
'7')
perl u5data_extractor.pl --begin 20070701 --end 20070730 --max 21000 --report 1201 >>fileTMP;
;;
'8')
perl u5data_extractor.pl --begin 20070801 --end 20070830 --max 21000 --report 1201 >>fileTMP;
;;
'9')
perl u5data_extractor.pl --begin 20070901 --end 20070930 --max 21000 --report 1201 >>fileTMP;
;;
'10')
perl u5data_extractor.pl --begin 20071001 --end 20071030 --max 21000 --report 1201 >>fileTMP;
;;
'11')
perl u5data_extractor.pl --begin 20071101 --end 20071130 --max 21000 --report 1201 >>fileTMP;
;;
'12')
perl u5data_extractor.pl --begin 20071201 --end 20071230 --max 21000 --report 1201 >>fileTMP;
;;
esac

#cleans the slash for easier editing
tr "/" "_" <>fileTMP2
#start of line removal
sed '
s/_index.cfm//
s/_&safe=vss//
s/_&adlt=strict//
s/.cfm//
s/,//
s/\ $//' <>fileTMP3
cat cleanThese | while read line; do
sed -ie "/$line/d" fileTMP3
done
less supplierTMP3 | cut -c 2-500 | grep -v "^ " | grep -v "^_" | grep -v "^#" >>fileTMPws
tr -s " " fileTMPcond
sed 's/ /,/' /Users/ChrisCopeland/Sites/urchinreports/file$currentYear-Month$input.csv
#cleans temp directory
rm fileTMP*
echo "your report is complete"

Now we can go through the script. The first part should be self explanatory, I am setting the values for the current date, month, and year (year is not yet implemented in my script, but will be soon). Then I ask the user for which report (monthly) they would like to generate, and show them a numerical list of the reports available (from 1 to current month). It will be in this case function that I will implement the current year, so that a user can get the month and year data needed. At the moment we are only interested in the year 2007.

Next comes the hairy part - the cleaning of useless data. I have a file for the most common items I want filtered. The file is called cleanThese (simple name). Before I open that file though, I want to clean certain characters and items which get skipped over do to the fact that this log file which is generated, will have a list of urls and paths in it. Paths and urls have weird characters sometimes, like ";" ":" "/" etc. Try passing these into a command line sometime and you will see how troublesome they can be. So let's get them out of there.

tr "/" "_" - replaced "/" with an underscore, which will make it easier to clean the rest of the log.

Now we throw this whole thing to sed - a great program.
sed '
s/_index.cfm//
s/_&safe=vss//
s/_&adlt=strict//
s/.cfm//
s/,//
s/\ $//' <>fileTMP3

we have a lot of oler cold fusion files, and some items that have a problem in the clean file, this sed command, which is a chained command, one per line, cleans these things out and leave nothing in the pattern's place. You can see now why I cleaned out the extra "/", I would have been passing a /// to sed, which it doesn't understand.

Now the cleanThese file:
cat cleanThese | while read line; do
sed -ie "/$line/d" fileTMP3
done
This reads each line of the cleanThese file (which I can modify at my desire) and replaces the entire line of the patten with nothing, effectively removing the line.

Then I want to clean up the formatting from the output of the cleaning:
less fileTMP3 | cut -c 2-500 | grep -v "^ " | grep -v "^_" | grep -v "^#" >>fileTMPws

This line will cut certain characters out (based on the original output), pass these to grep with an inverted search 3 times looking for different patterns, then writes that out to yet another tmp file.

The next line:
tr -s " " fileTMPcond
compresses all the space characters to one, outputs that to another file, which sed will take in, and replace the now single space, with a "," - effectively making this a csv file, which is names with the current month and set to a directory.

The next couple of lines clean the directory of tmp files and report to the user in the shell that the report is ready.

This will leave you a clean, importable, ready to query csv file just aching to be imported into a SQL engine of some sort.

Again if you want to run a different type of report, or use the case statement to generate a set of reports, you can visit my website to find a list of urchin reports available.

I want to find a good way to make this script available in a web interface at some point. I would also like to give the user a list of reports and years at the front of the application, just to help automate the process further.

Please Enjoy!

Thursday, October 18, 2007

How To: The Urchin Data Extractor (u5data_extractor)

You can get the perl scripts for customizing Urchin data at the Google Urchin Support Page. I read the little documentation on this subject, which is a basic how to, without much resource. Urchin support firms charge something serious to get this kind of thing done, and here I am being a nice guy, giving away what I learned FOR FREE.

So let's begin with the lessons I learned.

1. Use some form of linux/unix. I could not, for the life of me, get any of these scripts to work with Windows and I think this is because of the path. The perl script is looking for a unix like path. I'm sure there are those people out there, smarter than I, who can get this to work on a windows server, but I am not one of them. The examples I give will be run from a Macintosh running OS X 10.4.10, ActiveState Perl, and the bash. In addition I would like to thank the wonderful folks (yet again) over at macosxhints forums as well as unix.com forums for helping me get my syntax correct in my scripts.

2. Use a step by step process.

3. Verify your data, and backup! The last thing you want to do is run an untested and "use at your own risk" script on your Urchin reports.

4. Do not always believe the available documentation.

5. When report testing, use small segments of data for your report. It saves time and you get to test your text scrubber faster.

Ok - now let's get to the logical process. What I wanted to do was to pull certain reports from Urchin and post them to a database, preferably some flavor of SQL.

The process will look something like this.
1. run perl script with start date, end date, report type, and number of items returned.
2. save report as a text file
3. scrub text file for bad characters, bad lines, and data which is not applicable.
4. comma delimit the file
5. hand csv file to sql import engine.

sounds easy right? It is for the most part.

The u5data_extractor script will do a lot of this work for you. This is the usage section of the script, which will also show up in the command line if you call the script with ~$ perl u5data_extractor. I removed the copyright and some other text for the purpose of posting to the blog.
###########################################################
# Usage: u5data_extractor.pl [--begin YYYYMMDD] [--end YYYYMMDD] [--help]
# [--language LA] [--max N] [--profile PROFILE]
# [--report RRRR] [--urchinpath PATH]
#
# Where:
# '--begin YYYYMMDD' specifies the starting date (default: one week ago)
# '--end YYYYMMDD' specifies the ending date (default: yesterday)
# '--help' displays this message
# '--language LA' specifies the language for the report. Available
# languages are: ch, en, fr, ge, it, ja, ko, po, sp, and sw
# '--max N' is the maximum number of entries printed in the top 10 report
# types (default is 10).
# '--profile PROFILE' specifies the profile to retrieve data from. The
# default is specified at the beginning of this script
# '--report RRRR is the 4-digit number for the report (default is 1102)
# Run this script with --help to see a list of available reports
# '--urchinpath PATH' specifies the path to the Urchin distribution.
# Note that you can edit the script and set your path as a default
###################################################

Giving the script your default path:
You will need to give the script the path to the Urchin Directory.
this is the line for my machine (following a unix path):
my $urchinpath = "/usr/local/urchin"; # Path to the Urchin distribution

Give the script your default profile:
You will need to give the script the default profile.
This is the line for a made up profile in the script.
my $profile = "My Default Profile"; # Name of the default profile
This is important - you do not have to use %20 to represent spaces if you are using the quotes. Urchin, by default, stores the profile directories with %20 for whitespace characters.

The report number is a difficult thing. Where do you find those reports? I found an article, somewhere, which shows the report numbers. Have no fear, I made a list for you of the urchin report numbers.

I will give an example, since none was really given for me. Let's say I want to run a report from Jan 01, 2007 to Jan 27, 2007 for the report "Visitors & Sessions"
so when you call the script, you will be using the following syntax:
perl u5data_extractor --begin 20070101 --end 20071027 --report 1903 --max 10

this will generate the output to the standard out (screen), which I will not post due to privacy reasons.

If you want to redirect the output feel free to do so
perl u5data_extractor --begin 20070101 --end 20071027 --report 1903 --max 10>>output.file

Tomorrow I will post my scrubbing process as well as the script I used to call backup the data and generate the reports.

Enjoy!

Pages

Monday, October 22, 2007

How To: The Urchin Data Extractor part 2

Thursday, October 18, 2007

How To: The Urchin Data Extractor (u5data_extractor)