Thursday, October 18, 2007

How To: The Urchin Data Extractor (u5data_extractor)

You can get the perl scripts for customizing Urchin data at the Google Urchin Support Page. I read the little documentation on this subject, which is a basic how to, without much resource. Urchin support firms charge something serious to get this kind of thing done, and here I am being a nice guy, giving away what I learned FOR FREE.

So let's begin with the lessons I learned.

1. Use some form of linux/unix. I could not, for the life of me, get any of these scripts to work with Windows and I think this is because of the path. The perl script is looking for a unix like path. I'm sure there are those people out there, smarter than I, who can get this to work on a windows server, but I am not one of them. The examples I give will be run from a Macintosh running OS X 10.4.10, ActiveState Perl, and the bash. In addition I would like to thank the wonderful folks (yet again) over at macosxhints forums as well as unix.com forums for helping me get my syntax correct in my scripts.

2. Use a step by step process.

3. Verify your data, and backup! The last thing you want to do is run an untested and "use at your own risk" script on your Urchin reports.

4. Do not always believe the available documentation.

5. When report testing, use small segments of data for your report. It saves time and you get to test your text scrubber faster.

Ok - now let's get to the logical process. What I wanted to do was to pull certain reports from Urchin and post them to a database, preferably some flavor of SQL.

The process will look something like this.
1. run perl script with start date, end date, report type, and number of items returned.
2. save report as a text file
3. scrub text file for bad characters, bad lines, and data which is not applicable.
4. comma delimit the file
5. hand csv file to sql import engine.

sounds easy right? It is for the most part.

The u5data_extractor script will do a lot of this work for you. This is the usage section of the script, which will also show up in the command line if you call the script with ~$ perl u5data_extractor. I removed the copyright and some other text for the purpose of posting to the blog.
###########################################################
# Usage: u5data_extractor.pl [--begin YYYYMMDD] [--end YYYYMMDD] [--help]
# [--language LA] [--max N] [--profile PROFILE]
# [--report RRRR] [--urchinpath PATH]
#
# Where:
# '--begin YYYYMMDD' specifies the starting date (default: one week ago)
# '--end YYYYMMDD' specifies the ending date (default: yesterday)
# '--help' displays this message
# '--language LA' specifies the language for the report. Available
# languages are: ch, en, fr, ge, it, ja, ko, po, sp, and sw
# '--max N' is the maximum number of entries printed in the top 10 report
# types (default is 10).
# '--profile PROFILE' specifies the profile to retrieve data from. The
# default is specified at the beginning of this script
# '--report RRRR is the 4-digit number for the report (default is 1102)
# Run this script with --help to see a list of available reports
# '--urchinpath PATH' specifies the path to the Urchin distribution.
# Note that you can edit the script and set your path as a default
###################################################

Giving the script your default path:
You will need to give the script the path to the Urchin Directory.
this is the line for my machine (following a unix path):
my $urchinpath = "/usr/local/urchin"; # Path to the Urchin distribution

Give the script your default profile:
You will need to give the script the default profile.
This is the line for a made up profile in the script.
my $profile = "My Default Profile"; # Name of the default profile
This is important - you do not have to use %20 to represent spaces if you are using the quotes. Urchin, by default, stores the profile directories with %20 for whitespace characters.

The report number is a difficult thing. Where do you find those reports? I found an article, somewhere, which shows the report numbers. Have no fear, I made a list for you of the urchin report numbers.

I will give an example, since none was really given for me. Let's say I want to run a report from Jan 01, 2007 to Jan 27, 2007 for the report "Visitors & Sessions"
so when you call the script, you will be using the following syntax:
perl u5data_extractor --begin 20070101 --end 20071027 --report 1903 --max 10

this will generate the output to the standard out (screen), which I will not post due to privacy reasons.

If you want to redirect the output feel free to do so
perl u5data_extractor --begin 20070101 --end 20071027 --report 1903 --max 10>>output.file

Tomorrow I will post my scrubbing process as well as the script I used to call backup the data and generate the reports.

Enjoy!

No comments: