For each of the services we gather data for, here's a guide to getting that data:
Contents |
Mailman mailing lists can be downloaded, parsed and stored in a MySQL database using MLStats.
The general idea is to point mlstats at the list archive page, and let it do the work of figuring out what to download.
We are carrying a small local patch to mlstats to ensure that it re-downloads the current month's archives and reparses them. The patch has been submitted upstream.
Once you have downloaded and unpacked MLStats 0.4, patched it with the patch above, and installed it withsetup.py install --prefix=/install/pathyou will need to "prime the pump", and download and import the archives to all of the Maemo mailing lists.
The format of the mlstats command line is:
/path/to/mlstats --db-user=<username> --db-password=<password> http://lists.meego.com/pipermail/meego-announce/The command line option
--no-reportsuppresses the creation of a report after the import, useful for a cron job.
You can list the archives for a number of mailing lists together on the command line. The list of MeeGo mailing lists is here.
for list in meego-adaptation-intel-automotive meego-announce meego-architecture \
meego-commits meego-community meego-dev meego-distribution-tools \
meego-events meego-handset meego-il10n meego-inputmethods meego-it \
meego-ivi meego-kernel meego-packaging meego-pm meego-porting \
meego-python meego-qa meego-releases meego-sdk meego-security \
meego-security-discussion meego-touch-dev meego-tv;
do
/path/to/mlstats --no-report --db-user=<username> --db-password=<password> http://lists.meego.com/pipermail/${list} >> /tmp/output.txt
done;
This should be run every night through cron.
The SQL query (yes, a Big Hairy Beast) which can be used to create a report graphing the messages posted to each list, month by month, is this:
SELECT
`messages`.`mailing_list_url` AS list,
year(first_date) AS y,
monthname(first_date) AS mon,
month(first_date) AS m,
date_format(first_date, '%M %Y') as monthstr,
date_format(first_date,'%Y%m') as monthnum,
count(*) AS c
FROM
`messages`
WHERE
year(first_date) > 1979 and
mailing_list_url not like '%meego-commits%' and first_date<'2011-05-01'
GROUP BY
`messages`.`mailing_list_url`,
y,m
ORDER BY
monthnum ASC,
list ASC,
c ASC
We use $monthnum to order the results over time, and $monthstr as a more meaningful X axis label. We eliminate all emails with bad Date headers and filter meego-commits out of our analysis, and group the messages by mailing list, to give the following graph as the final result (how to create a report will follow later):
SuperSeriousStats is a super serious IRC log analyser, developed by Jos de Ruiter. It is written in PHP.
To use SuperSeriousStats, you need to install the php-cli command line interface. Once you have this, and the other dependencies for sss listed here, you will be able to run sss.
Before running it for the first time, initialise the database. You will need to download all of the IRC logs you wish to import first. Logs are expected in the Supybot format (example).
For the initial download, the following script should help get all the logs:
for y in 2010 2011; do for m in `seq -f '%02g' 1 12`; do for d in `seq -f '%02g' 1 31`; do wget http://mg.pov.lt/meego-irclog/%23meego.$y-$m-$d.log; done; done; done;
Thereafter, you can add the following line in your cron job to get the previous day's logs:
logdate= `date +%Y-%m-%d -d yesterday`
wget -P $logdir http://mg.pov.lt/meego-irclog/%23meego.${logdate}.log;
To run sss, you will need to modify a number of fields in sss.conf. Notably, I had to change the filename format to "*.Y-m-d.\lo\g" to match the log file names, and set log format to supybot. I recommend creating a separate config file for each channel you want to parse. Here is the appropriate sss.conf for #meego logs:
# # This file contains all of sss' settings along with their defaults. # All values must be placed between double quotes, even empty ones. # ################################# Required ################################### channel = "#meego" # Name of the IRC channel. timezone = "UTC" # Timezone the logs are in. Used for time offset # calculations and conversions. # See http://php.net/manual/en/timezones.php db_host = "host" # IP address or FQDN of the MySQL server. db_port = "3306" # Port the MySQL server is listening on. db_user = "db_user" # MySQL user. db_pass = "db_password" # MySQL password. db_name = "db_name" # Name of the MySQL database used for sss. parser = "parser_supybot" # The parser to use depending on logfile format. # e.g. "parser_irssi" or "parser_eggdrop" # This string contains the format of the date within a logfile filename. # Examples: # filename: #chatroom.20030131 dateformat: *.Ymd # filename: #chatroom.20030131 dateformat: \#c\h\atroo\m.Ymd # filename: chatroom.log-31012003 dateformat: *.*-dmY # filename: chatroom.log-31012003.gz dateformat: *.*-dmY.\g\z # filename: chatroom.log-31012003.gz dateformat: *.*-dmY.* # See http://php.net/date_create_from_format for more specific syntax options. logfile_dateformat = "*.Y-m-d.\lo\g"
sss has a mechanism to prevent it from re-importing log files for the same channel, by storing dates for which it has already parsed files in the table "parse_history", but you can store summary data from several channels in one database.
To do the initial import, just run
php sss.php -c meego.conf -i ${logdir} -o report.html
And to do the nightly import, put the following in a shell script, and call it from cron:
logdate=`date +%Y-%m-%d -d yesterday`
logdir=/path/to/logfiles
sssdir=/path/to/sss-4.0
/usr/bin/wget -P $logdir http://mg.pov.lt/meego-irclog/%23meego.${logdate}.log;
/usr/bin/php ${sssdir}/sss.php -c meego.conf -i ${logdir}/\#meego.${logdate}.log
The sss report is built from a number of queries. A number of other useful sss queries are already listed in the Dashboard page
Activity by month
select date_format(`date`, '%Y-%m') as `date`, sum(`l_total`) as `l_total`, sum(`l_night`) as `l_night`, sum(`l_morning`) as `l_morning`, sum(`l_afternoon`) as `l_afternoon`, sum(`l_evening`) as `l_evening` from `channel` group by year(`date`), month(`date`);
Activity by day over last 30 days
select `date`, `l_total`, `l_night`, `l_morning`, `l_afternoon`, `l_evening` from `channel` where `date` > DATE_SUB(CURDATE(),INTERVAL 30 DAY);