Contents |
The goal is to provide a web page summarising metrics about various aspects of the MeeGo project. The data should update regularly - depending on the metric, that could be real time or updated automatically on a regular basis.
The dashboard tracks track the following community resources, ideally:
The data should also be available for custom reports for usage and analysis in the monthly MeeGo Metrics report published by User:DawnFoster
To fulfill these goals, the dashboard will gather data from the various resource into a centralised database, using some sort of Business Intelligence platform including ETL for data acquisition and storage, and a reporting service for generating reports and dashboards.. A web page will provide a view into this database with predefined reports.
Download the monthly summary report as a Pentaho report file here: File:Meego metrics summary.prpt
Pentaho runs as a webapp in Tomcat6. It can use a variety of databases for its internal data structures, the default (Hypersonic) is a Java database. However, because it's both standard & well understood and to allow consolidation of databases under one DB server, I prefer to use MySQL. The configuration of Pentaho with a MySQL database is a little tricky, but almost all of the steps are covered well in this tutorial.
The data which is useful for metrics will be copied into a local database from each of the services we query. The copying of data will be accomplished by a set of Kettle "xactions", which can be created and edited easily with the Spoon tool.
A number of reports will be generated using the Pentaho Report Designer, including a static HTML/Flash dashboard which will be published regularly. Other reports can be created for the community managers, and a more advanced dashboard, allowing detailed analysis of basic metrics, can be provided via the Community Dashboard Framework.
We will need to see how much load the dashboard will generate on the server. I suspect that it will not be practical to expose the dashboard in public.
We will document here everything you need to do to replicate the MeeGo Community Dashboard, with the exception of data which is not publicly available because it contains security related or confidential information (mainly bugzilla).
For SQL databases, we have access to the database server. This applies to MediaWiki, Bugzilla, and Drupal.
For the forum, we integrate the CSV files currently being exported, which provide the basic analytics we need. A cron job with mysqlimport is sufficient.
Individual mailing lists are parsed by MLStats. We use the resulting database directly in the dashboard.
IRC logs will be parsed with superseriousstats, a PHP command line tool that parses IRC logs and stores the results in an SQL database.
We still need to figure out how to do data interchange with Transifex and OBS, and how to get code metrics from the commit mailing list or git. Dimitris tells me that there are already some analytics available on Transifex, and that there is a RESTful API available to query this data.
For each of the resources, the following statistics (at a minimum) should be extracted:
select subject,year(first_date) as y, monthname(first_date),count(*) as c from messages group by subject, month(first_date) order by y, month(first_date), c;
select p.email_address,year(m.first_date) as y, monthname(m.first_date),count(*) as c from messages as m,messages_people as p where m.message_id=p.message_ID group by p.email_address, month(m.first_date) order by y, month(m.first_date), c;
where month(first_date)=3 and year(first_date)=2010 for March 2010. For the current month, month(m.first_date)=month(NOW()) and year(first_date)=year(NOW()) works.
Stats are exported from the Forum in CSV format monthly.
This all depends on what is available from Transifex.
Using a modified version of gitdm to dump Git logs into a MySQL database for analysis. Modifications required:
gitdm has 3 basic data structures: Hacker, Employer & Patch. Each changeset is a Patch object, each Patch has an Author, and is assigned to an Employer (based on who the Hacker was working for at the time of the Patch). Each Patch also has a list of Hackers who reviewed, reported, signed-off on and tested the patch. Each Hacker links to a list of Patches for which they are the author, a list of email addresses they have used to commit, and separate lists for reviewed, reported, SOB and tested. In addition, each Hacker has a list of Employers he has worked for, and each Employer has a list of Hackers who have worked for them.
The area of Business Intelligence is littered with acronyms. Here's a quick overview of the main ones, and how they all fit together.
The community dashboard project uses a business reporting engine to query that data and present it in a report.
Modules available:
| Software | License | ETL | OLAP database | BI server | Reporting | Dashboard module |
|---|---|---|---|---|---|---|
| Pentaho | EPL | Kettle | Mondrian | Pentaho BI Platform | Pentaho Reporting | Community Dashboard Framework |
| Jaspersoft | AGPL v3 | JasperETL (Talend Open Studio) | JasperOLAP | JasperReports Server | iReports editor | No (commercial only) |
Pentaho is used as the basis of Mozilla's metrics project, and provides a very strong community software option for both the dashboard and for managing the BI server. Since Mozilla metrics work overlaps what we are trying to achieve, particularly their work on SQR, the Software Quality Reports analytic module for Bugzilla and JIRA, Pentaho is my preference for the dashboard project. In general, I have observed that the Pentaho community provides very good support.