For each of the services we gather data for, here's a guide to getting that data:
Contents |
Mailman mailing lists can be downloaded, parsed and stored in a MySQL database using MLStats.
The general idea is to point mlstats at the list archive page, and let it do the work of figuring out what to download.
We are carrying a small local patch to mlstats to ensure that it re-downloads the current month's archives and reparses them. The patch has been submitted upstream.
Once you have downloaded and unpacked MLStats 0.4, patched it with the patch above, and installed it withsetup.py install --prefix=/install/pathyou will need to "prime the pump", and download and import the archives to all of the Maemo mailing lists.
The format of the mlstats command line is:
/path/to/mlstats --db-user=<username> --db-password=<password> http://lists.meego.com/pipermail/meego-announce/The command line option
--no-reportsuppresses the creation of a report after the import, useful for a cron job.
You can list the archives for a number of mailing lists together on the command line. The list of MeeGo mailing lists is here.
for list in meego-adaptation-intel-automotive meego-announce meego-architecture \
meego-commits meego-community meego-dev meego-distribution-tools \
meego-events meego-handset meego-il10n meego-inputmethods meego-it \
meego-ivi meego-kernel meego-packaging meego-pm meego-porting \
meego-python meego-qa meego-releases meego-sdk meego-security \
meego-security-discussion meego-touch-dev meego-tv;
do
/path/to/mlstats --no-report --db-user=<username> --db-password=<password> http://lists.meego.com/pipermail/${list} >> /tmp/output.txt
done;
This should be run every night through cron.
The SQL query (yes, a Big Hairy Beast) which can be used to create a report graphing the messages posted to each list, month by month, is this:
SELECT
`messages`.`mailing_list_url` AS list,
year(first_date) AS y,
monthname(first_date) AS mon,
month(first_date) AS m,
date_format(first_date, '%M %Y') as monthstr,
date_format(first_date,'%Y%m') as monthnum,
count(*) AS c
FROM
`messages`
WHERE
year(first_date) > 1979 and
mailing_list_url not like '%meego-commits%' and first_date<'2011-05-01'
GROUP BY
`messages`.`mailing_list_url`,
y,m
ORDER BY
monthnum ASC,
list ASC,
c ASC
We use $monthnum to order the results over time, and $monthstr as a more meaningful X axis label. We eliminate all emails with bad Date headers and filter meego-commits out of our analysis, and group the messages by mailing list, to give the following graph as the final result (how to create a report will follow later):
SuperSeriousStats is a super serious IRC log analyser, developed by Jos de Ruiter. It is written in PHP.
To use SuperSeriousStats, you need to install the php-cli command line interface. Once you have this, and the other dependencies for sss listed here, you will be able to run sss.
Before running it for the first time, initialise the database. You will need to download all of the IRC logs you wish to import first. Logs are expected in the Supybot format (example).
For the initial download, the following script should help get all the logs:
for y in 2010 2011; do for m in `seq -f '%02g' 1 12`; do for d in `seq -f '%02g' 1 31`; do wget http://mg.pov.lt/meego-irclog/%23meego.$y-$m-$d.log; done; done; done;
Thereafter, you can add the following line in your cron job to get the previous day's logs:
logdate= `date +%Y-%m-%d -d yesterday`
wget -P $logdir http://mg.pov.lt/meego-irclog/%23meego.${logdate}.log;
To run sss, you will need to modify a number of fields in sss.conf. Notably, I had to change the filename format to "*.Y-m-d.\lo\g" to match the log file names, and set log format to supybot. I recommend creating a separate config file for each channel you want to parse. Here is the appropriate sss.conf for #meego logs:
# # This file contains all of sss' settings along with their defaults. # All values must be placed between double quotes, even empty ones. # ################################# Required ################################### channel = "#meego" # Name of the IRC channel. timezone = "UTC" # Timezone the logs are in. Used for time offset # calculations and conversions. # See http://php.net/manual/en/timezones.php db_host = "host" # IP address or FQDN of the MySQL server. db_port = "3306" # Port the MySQL server is listening on. db_user = "db_user" # MySQL user. db_pass = "db_password" # MySQL password. db_name = "db_name" # Name of the MySQL database used for sss. parser = "parser_supybot" # The parser to use depending on logfile format. # e.g. "parser_irssi" or "parser_eggdrop" # This string contains the format of the date within a logfile filename. # Examples: # filename: #chatroom.20030131 dateformat: *.Ymd # filename: #chatroom.20030131 dateformat: \#c\h\atroo\m.Ymd # filename: chatroom.log-31012003 dateformat: *.*-dmY # filename: chatroom.log-31012003.gz dateformat: *.*-dmY.\g\z # filename: chatroom.log-31012003.gz dateformat: *.*-dmY.* # See http://php.net/date_create_from_format for more specific syntax options. logfile_dateformat = "*.Y-m-d.\lo\g"
sss has a mechanism to prevent it from re-importing log files for the same channel, by storing dates for which it has already parsed files in the table "parse_history", but you can store summary data from several channels in one database.
To do the initial import, just run
php sss.php -c meego.conf -i ${logdir} -o report.html
And to do the nightly import, put the following in a shell script, and call it from cron:
logdate=`date +%Y-%m-%d -d yesterday`
logdir=/path/to/logfiles
sssdir=/path/to/sss-4.0
/usr/bin/wget -P $logdir http://mg.pov.lt/meego-irclog/%23meego.${logdate}.log;
/usr/bin/php ${sssdir}/sss.php -c meego.conf -i ${logdir}/\#meego.${logdate}.log
The sss report is built from a number of queries. A number of other useful sss queries are already listed in the Dashboard page
Activity by month
select date_format(`date`, '%Y-%m') as `date`,
sum(`l_total`) as `l_total`,
sum(`l_night`) as `l_night`,
sum(`l_morning`) as `l_morning`,
sum(`l_afternoon`) as `l_afternoon`,
sum(`l_evening`) as `l_evening`
from `channel`
group by year(`date`), month(`date`);
Activity by day over last 30 days
select `date`, `l_total`, `l_night`, `l_morning`, `l_afternoon`, `l_evening` from `channel` where `date` > DATE_SUB(CURDATE(),INTERVAL 30 DAY);
Forum stats are available as a series of CSV files on forums.meego.com, supplied monthly. We need to download the .csv files every month (just the latest ones), parse the CSV files into a database and generate a report from that.
To get started and load up all of the old stats, run the following:
wget -nd -P <local dir for data> -r -l1 --no-parent -A.csv http://forum.meego.com/stats
This will download all CSV files to the local directory specified.
For the monthly refresh, we use wget in a cron script as follows:
y=`/bin/date -d "1 month ago" +%Y`
m=`/bin/date -d "1 month ago" +%m`
wget -nd -P <local dir for data> -r -l1 --no-parent -A "${y}${m}*.csv" http://forum.meego.com/stats
That wget command line is worth explaining:
We created 7 tables, one for each of the statistics provided by the forum.
Here is the database schema:
-- -- Table structure for table `forum_cumulative_posts` -- DROP TABLE IF EXISTS `forum_cumulative_posts`; CREATE TABLE `forum_cumulative_posts` ( `month` int(11) NOT NULL, `year` int(11) NOT NULL, `forum` varchar(50) NOT NULL DEFAULT '', `posts` int(11) DEFAULT NULL, PRIMARY KEY (`month`,`year`,`forum`) ); -- -- Table structure for table `forum_cumulative_threads` -- DROP TABLE IF EXISTS `forum_cumulative_threads`; CREATE TABLE `forum_cumulative_threads` ( `month` int(11) NOT NULL, `year` int(11) NOT NULL, `forum` varchar(50) NOT NULL DEFAULT '', `threads` int(11) DEFAULT NULL, PRIMARY KEY (`month`,`year`,`forum`) ); -- -- Table structure for table `forum_hottest_threads` -- DROP TABLE IF EXISTS `forum_hottest_threads`; CREATE TABLE `forum_hottest_threads` ( `month` int(11) NOT NULL, `year` int(11) NOT NULL, `rank` int(11) NOT NULL DEFAULT '0', `title` varchar(255) DEFAULT NULL, `posts` int(11) DEFAULT NULL, PRIMARY KEY (`month`,`year`,`rank`) ); -- -- Table structure for table `forum_most_viewed_threads` -- DROP TABLE IF EXISTS `forum_most_viewed_threads`; CREATE TABLE `forum_most_viewed_threads` ( `month` int(11) NOT NULL, `year` int(11) NOT NULL, `rank` int(11) NOT NULL DEFAULT '0', `title` varchar(255) DEFAULT NULL, `views` int(11) DEFAULT NULL, PRIMARY KEY (`month`,`year`,`rank`) ); -- -- Table structure for table `forum_posts` -- DROP TABLE IF EXISTS `forum_posts`; CREATE TABLE `forum_posts` ( `month` int(11) NOT NULL, `year` int(11) NOT NULL, `forum` varchar(50) NOT NULL DEFAULT '', `posts` int(11) DEFAULT NULL, PRIMARY KEY (`month`,`year`,`forum`) ); -- -- Table structure for table `forum_top_posters` -- DROP TABLE IF EXISTS `forum_top_posters`; CREATE TABLE `forum_top_posters` ( `month` int(11) NOT NULL, `year` int(11) NOT NULL, `rank` int(11) NOT NULL DEFAULT '0', `member` varchar(50) DEFAULT NULL, `posts` int(11) DEFAULT NULL, PRIMARY KEY (`month`,`year`,`rank`) ); -- -- Table structure for table `forum_top_thanked` -- DROP TABLE IF EXISTS `forum_top_thanked`; CREATE TABLE `forum_top_thanked` ( `month` int(11) NOT NULL, `year` int(11) NOT NULL, `rank` int(11) NOT NULL DEFAULT '0', `member` varchar(50) DEFAULT NULL, `thanks` int(11) DEFAULT NULL, PRIMARY KEY (`month`,`year`,`rank`) );We import the files into the databases via
LOAD DATA LOCAL INFILE
cd <local data directory>
if [ -f ${y}${m}_forum_cumulative_posts.csv ]; then
echo "Importing ${y}${m}_forum_cumulative_posts.csv";
# forum,posts
mysql -u<db_user> -p<db_password> -h <host> <database> -e \
"LOAD DATA LOCAL INFILE '${y}${m}_forum_cumulative_posts.csv'
INTO TABLE forum_cumulative_posts
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
IGNORE 1 LINES
(forum, posts)
set year=$y, month=$m"
fi;
if [ -f ${y}${m}_forum_cumulative_threads.csv ]; then
echo "Importing ${y}${m}_forum_cumulative_threads.csv";
# forum,threads
mysql -u<db_user> -p<db_password> -h <host> <database> -e \
"LOAD DATA LOCAL INFILE '${y}${m}_forum_cumulative_threads.csv'
INTO TABLE forum_cumulative_threads
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
IGNORE 1 LINES
(forum, threads)
set year=$y, month=$m"
fi;
if [ -f ${y}${m}_forum_posts.csv ]; then
echo "Importing ${y}${m}_forum_posts.csv";
# forum,posts
mysql -u<db_user> -p<db_password> -h <host> <database> -e \
"LOAD DATA LOCAL INFILE '${y}${m}_forum_posts.csv'
INTO TABLE forum_posts
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
IGNORE 1 LINES
(forum, posts)
set year=$y, month=$m"
fi;
if [ -f ${y}${m}_hottest_threads.csv ]; then
echo "Importing ${y}${m}_hottest_threads.csv";
# rank,title,posts
mysql -u<db_user> -p<db_password> -h <host> <database> -e \
"LOAD DATA LOCAL INFILE '${y}${m}_hottest_threads.csv'
INTO TABLE forum_hottest_threads
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
IGNORE 1 LINES
(rank, title, posts)
set year=$y, month=$m"
fi;
if [ -f ${y}${m}_most_viewed_threads.csv ]; then
echo "Importing ${y}${m}_most_viewed_threads.csv";
# rank,title,views
mysql -u<db_user> -p<db_password> -h <host> <database> -e \
"LOAD DATA LOCAL INFILE '${y}${m}_most_viewed_threads.csv'
INTO TABLE forum_most_viewed_threads
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
IGNORE 1 LINES
(rank, title, views)
set year=$y, month=$m"
fi;
if [ -f ${y}${m}_top_posters.csv ]; then
echo "Importing ${y}${m}_top_posters.csv";
# rank,member,posts
mysql -u<db_user> -p<db_password> -h <host> <database> -e \
"LOAD DATA LOCAL INFILE '${y}${m}_top_posters.csv'
INTO TABLE forum_top_posters
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
IGNORE 1 LINES
(rank, member, posts)
set year=$y, month=$m"
fi;
if [ -f ${y}${m}_top_thanked.csv ]; then
echo "Importing ${y}${m}_top_thanked.csv";
# rank,member,thanks
mysql -u<db_user> -p<db_password> -h <host> <database> -e \
"LOAD DATA LOCAL INFILE '${y}${m}_top_thanked.csv'
INTO TABLE forum_top_thanked
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
IGNORE 1 LINES
(rank, member, thanks)
set year=$y, month=$m"
fi;
We run this script (which downloads and imports the .csv files froim the server) monthly through cron. Not sure when the files are put up on the 1st, so I get them on the 2nd:
# Retrieve forum stats monthly on 2nd of month 15 2 2 * * /home/dneary/bin/forum_stats.sh