Meego Wiki
Views

Metrics/Dashboard

From MeeGo wiki
< Metrics(Difference between revisions)
Jump to: navigation, search
(Community Metrics Dashboard)
 
(29 intermediate revisions not shown)
Line 3: Line 3:
The goal is to provide a web page summarising metrics about various aspects of the MeeGo project. The data should update regularly - depending on the metric, that could be real time or  updated automatically on a regular basis.
The goal is to provide a web page summarising metrics about various aspects of the MeeGo project. The data should update regularly - depending on the metric, that could be real time or  updated automatically on a regular basis.
-
The dashboard will track the following community resources, ideally:  
+
The dashboard tracks track the following community resources, ideally:  
* Drupal members
* Drupal members
 +
* Bugzilla (bugs opened, bugs closed, active users)
* Mailing lists (members, posts, threads)
* Mailing lists (members, posts, threads)
* gitorious (commits, employer details for committers) - should use Jon Corbet's scripts like are used in the LF yearly kernel data.
* gitorious (commits, employer details for committers) - should use Jon Corbet's scripts like are used in the LF yearly kernel data.
* Wiki (edits, new pages)
* Wiki (edits, new pages)
* Forums (members, posts)
* Forums (members, posts)
-
* IRC (totql comments, people on channel)
+
* IRC (total comments, people on channel)
* Transifex (Languages, translators, strings translated)
* Transifex (Languages, translators, strings translated)
* Community OBS (uploads, users)
* Community OBS (uploads, users)
Line 16: Line 17:
The data should also be available for custom reports for usage and analysis in the monthly MeeGo Metrics report published by [[User:DawnFoster]]
The data should also be available for custom reports for usage and analysis in the monthly MeeGo Metrics report published by [[User:DawnFoster]]
-
To fulfill these goals, the dashboard will gather data from the various resource into a centralised database, using an [http://en.wikipedia.org/wiki/Enterprise_Service_Bus ESB] or [http://en.wikipedia.org/wiki/Enterprise_application_integration EAI] of some sort. A web page will be a view into this database with predefined reports.
+
To fulfill these goals, the dashboard will gather data from the various resource into a centralised database, using some sort of Business Intelligence platform including ETL for data acquisition and storage, and a reporting service for generating reports and dashboards.. A web page will provide a view into this database with predefined reports.
-
Candidate reporting/ESB/EAI solutions:
+
Download the monthly summary report as a Pentaho report file here: [[File:Meego metrics summary.prpt]]
-
* [http://jasperforge.org/index.php?q=project/jasperreports JasperReports]
+
== Architecture ==
-
* [http://www.pentaho.com/ Pentaho]
+
-
* [http://www.talend.com/index.php Talend]
+
-
* [http://petals.ow2.org/ Petals]
+
-
* [http://www.mulesoft.com/ MuleSoft]
+
-
For each community resource, we need to figure out how to ETL the data into a usable form, and come up with appropriate queries for metrics reports, and finally present the results on a webpage.
+
Pentaho runs as a webapp in Tomcat6. It can use a variety of databases for its internal data structures, the default (Hypersonic) is a Java database. However, because it's both standard & well understood and to allow consolidation of databases under one DB server, I prefer to use MySQL. The configuration of Pentaho with a MySQL database is a little tricky, but almost all of the steps are covered well [https://docs.google.com/Doc?docid=0AdJmocc0fj_EZDJ3YmZiZF83M2RtaHhwcmRk&hl=en in this tutorial].
 +
 
 +
The data which is useful for metrics will be copied into a local database from each of the services we query. The copying of data will be accomplished by a set of Kettle "xactions", which can be created and edited easily with the Spoon tool.
 +
 
 +
A number of reports will be generated using the Pentaho Report Designer, including a static HTML/Flash dashboard which will be published regularly. Other reports can be created for the community managers, and a more advanced dashboard, allowing detailed analysis of basic metrics, can be provided via the Community Dashboard Framework.
 +
 
 +
We will need to see how much load the dashboard will generate on the server. I suspect that it will not be practical to expose the dashboard in public.
 +
 
 +
We will document here everything you need to do to replicate the MeeGo Community Dashboard, with the exception of data which is not publicly available because it contains security related or confidential information (mainly bugzilla).
 +
 
 +
* [[../Installing Pentaho]]: A guide to getting Pentaho up and running for MeeGo Dashboard
 +
* [[../Gathering data]]: A guide to installing and using the tools to gather data for a dashboard
 +
* [[../Creating a report]]: Given data in a database, how do we generate a report in Pentaho, and deploy it to the dashboard?
 +
 
 +
* [[../Mailing list queries]]: SQL queries against the MLStats database
 +
* [[../MediaWiki queries]]: SQL queries against a <nowiki>MediaWiki</nowiki> database
 +
* [[../IRC queries]]: SQL queries against the superseriousstats database
 +
* [[../Forum queries]]: SQL queries against the forum database
 +
* [[../Bugzilla queries]]: SQL queries against the Bugzilla database
 +
 
 +
 
 +
=== Extracting data ===
 +
 
 +
For SQL databases, we have access to the database server. This applies to MediaWiki, Bugzilla, and Drupal.
 +
 
 +
For the forum, we integrate the CSV files currently being exported, which provide the basic analytics we need. A cron job with mysqlimport is sufficient.
 +
 
 +
Individual mailing lists are parsed by [http://forge.morfeo-project.org/projects/libresoft-tools/ MLStats]. We use the resulting database directly in the dashboard.
 +
 
 +
IRC logs will be parsed with [http://code.google.com/p/superseriousstats/ superseriousstats], a PHP command line tool that parses IRC logs and stores the results in an SQL database.
 +
 
 +
We still need to figure out how to do data interchange with Transifex and OBS, and how to get code metrics from the commit mailing list or git. Dimitris tells me that there are already [http://meego.transifex.net/stats/ some analytics] available on Transifex, and that there is a RESTful API available to query this data.
 +
 
 +
== Data in report ==
 +
 
 +
For each of the resources, the following statistics (at a minimum) should be extracted:
 +
 
 +
=== Mailing lists ===
 +
* Subscriber numbers (+ evolution) - from Mailman directly, not available in mlstats
 +
* Emails sent (+ evolution) - from mlstats
 +
* Active participants (individuals with >=2 emails during month) - from mlstats
 +
* Hot threads - from mlstats
 +
* Top posters - from mlstats
 +
 
 +
==== Useful queries ====
 +
 
 +
* Count posts of most popular threads:
 +
*: <code>select subject,year(first_date) as y, monthname(first_date),count(*) as c from messages group by subject, month(first_date) order by y, month(first_date), c;</code>
 +
* Count the number of posts for each person
 +
*: <code>select p.email_address,year(m.first_date) as y, monthname(m.first_date),count(*) as c from messages as m,messages_people as p where m.message_id=p.message_ID group by p.email_address, month(m.first_date) order by y, month(m.first_date), c;</code>
 +
* Restricting queries to a date range / month
 +
*: The query above gives month-by-month totals, you could add an order by year(first_date) to get the year too
 +
*: The easiest way is to add <code>where month(first_date)=3 and year(first_date)=2010</code> for March 2010. For the current month, <code>month(m.first_date)=month(NOW()) and year(first_date)=year(NOW())</code> works.
 +
 
 +
=== Forum ===
 +
* Posts per month (+evolution)
 +
* Active posters (2+ posts during month)
 +
* Hot topics
 +
* Top posters
 +
 
 +
[http://forum.meego.com/stats/ Stats] are exported from the Forum in CSV format monthly.
 +
 
 +
=== Bugzilla ===
 +
* Bugs created (+ evolution)
 +
* Bugs resolved (+ evolution)
 +
* Comments on bugs this month
 +
* Active Bugzilla contributors (2+ comments during month)
 +
 
 +
==== Useful queries ====
 +
 
 +
* [http://sourceforge.net/projects/qareports/ Software Quality Reports for Pentaho] of course
 +
* See the [http://bugs.meego.com/query.cgi?format=report-table bugzilla tabular reports] and relative dates for more.
 +
* Other potential resources:
 +
** [http://www.ravenbrook.com/project/p4dti/tool/cgi/bugzilla-schema/ Bugzilla database schema by version]
 +
** [https://landfill.bugzilla.org/bugzilla-tip/report.cgi Demo site for standard Bugzilla reports]
 +
** [https://wiki.mozilla.org/Bugzilla:SQLCookBook The Bugzilla SQL cookbook]
 +
** [http://www.bugzillametrics.org/ Bugzilla Metrics] - Third party add-on to Bugzilla that produces lots of metrics
 +
** browse.cgi and weekly-summary.html extensions of GNOME Bugzilla: [https://launchpad.net/bugzilla.gnome.org Code], examples in practice: [http://bugzilla.gnome.org/browse.cgi?product=Evolution browse.cgi], [https://bugzilla.gnome.org/page.cgi?id=weekly-bug-summary.html weekly-summary.cgi]
 +
 
 +
=== Mediawiki ===
 +
* New wiki pages (+ evolution)
 +
* Edits this month (+ evolution)
 +
* Pages deleted this month (+ evolution)
 +
* Unique editors this month (+ evoluion)
 +
 
 +
=== Drupal ===
 +
* Members of meego.com (+ evolution month over month)
 +
* Active members (need a decent way to hook up different ways a person can be active: wiki, ML, IRC, forum, git)
 +
 
 +
=== Transifex ===
 +
* Total languages ordered by translation coverage (+ evolution)
 +
* Top languages
 +
* Top translators/teams
 +
 
 +
This all depends on what is available from Transifex.
 +
 
 +
=== Git ===
 +
* Commits this month (+ evolution)
 +
* Top committers (+ evolution)
 +
* Committers by company (possible)
 +
* Active modules (+ evolution)
 +
 
 +
Using a modified version of gitdm to dump Git logs into a MySQL database for analysis. Modifications required:
 +
* Create database & tables based on the gitdm data structures
 +
* Dump data in correct order, avoiding redundancy if possible, into SQL database
 +
 
 +
gitdm has 3 basic data structures: Hacker, Employer & Patch. Each changeset is a Patch object, each Patch has an Author, and is assigned to an Employer (based on who the Hacker was working for at the time of the Patch). Each Patch also has a list of Hackers who reviewed, reported, signed-off on and tested the patch. Each Hacker links to a list of Patches for which they are the author, a list of email addresses they have used to commit, and separate lists for reviewed, reported, SOB and tested. In addition, each Hacker has a list of Employers he has worked for, and each Employer has a list of Hackers who have worked for them.
 +
 
 +
=== Community & Official OBS ===
 +
* Package submissions (+ evolution)
 +
* Active participants (+ evolution)
 +
 
 +
== Business intelligence lexicography ==
 +
 
 +
The area of Business Intelligence is littered with acronyms. Here's a quick overview of the main ones, and how they all fit together.
 +
 
 +
; BI
 +
: Business Intelligence - general name for any middleware which allows you to query business processes (sales, inventory, etc) and get data overviews from it
 +
; ETL
 +
: Extract, Transform, and Load - the process if extracting data from a data source (database, screen scraping, text file parsing, whatever), transforming it to a well understood format, and loading it in your BI engine database or data warehouse. Good ETL solutions provide a nice way for you to connect another database and have new data sucked in at regular intervals, define views into the source data store which you can then query within your BI engine, etc. Pentaho's ETL, [http://kettle.pentaho.com/ Kettle], and [http://www.jaspersoft.com/jasperetl JasperETL], used by JasperReports, both provide (kind of) straightforward ways to hook into a MySQL database.
 +
; ESB
 +
: [http://en.wikipedia.org/wiki/Enterprise_service_bus Enterprise Service Bus] - a middleware bus providing a unique interface to applications on the front-end and data stores on the back end. Often used to link up many front-end applications (eg. library, student registration, employee payroll, syllabus management, accounting, supply-chain, student lodgement programmes, etc in a university). Not really useful for us, as far as I can tell.
 +
; EAI
 +
: Enterprise Application Integration - using software to integrate different applications together. As far as I can tell, this is a meaningless catch-all phrase for anything from kludges to architected business intelligence solutions.
 +
; DW
 +
: Data Warehouse. Basically the same thing as a database, as far as I can tell, but bigger and more impressive sounding.
 +
; OLAP
 +
: On-Line Analytical Processing. Commonly used acronym for extracting data via multi-dimensional queries. Databases can be configured to provide the results of this kind of query. As far as I can tell this is mostly a buzzword - an "OLAP database" like [http://mondrian.pentaho.com/ Mondrian] is basically the same thing as a database. "speed-of-thought" response times indeed.
 +
; Business reporting
 +
: An application which allows a graphical view of a database, and allows you to construct queries interactively, often using drag & drop. The results of these queries can then be plugged into graphing software for presentation in a dashboard.
 +
; Dashboard
 +
: Organised presentation of information in a web-page or other similar format allowing an at-a-glance overview of the situation for the data being measured.
 +
 
 +
The community dashboard project uses a business reporting engine to query that data and present it in a report.
=== Comparison of candidate ETL/reporting ===
=== Comparison of candidate ETL/reporting ===
-
Support for community infrastructure:
+
Modules available:
-
* JasperReports:
+
{|
-
** Bugzilla: http://jasperforge.org/projects/bugzillareportswitholap
+
! Software !! License !! ETL !! OLAP database !! BI server !! Reporting !! Dashboard module
 +
|-
 +
| Pentaho || EPL || [http://kettle.pentaho.com/ Kettle] || [http://mondrian.pentaho.com/ Mondrian] || [http://community.pentaho.com/projects/bi_platform/ Pentaho BI Platform] || [http://reporting.pentaho.com/ Pentaho Reporting] || [http://wiki.pentaho.com/display/COM/Community+Dashboard+Framework Community Dashboard Framework]
 +
|-
 +
| [http://www.jaspersoft.com/editions Jaspersoft] || AGPL v3 || JasperETL (Talend Open Studio) || JasperOLAP || JasperReports Server || iReports editor || No (commercial only)
 +
|}
-
* Pentaho
+
Pentaho is used as the basis of Mozilla's metrics project, and provides a very strong community software option for both the dashboard and for managing the BI server. Since Mozilla metrics work overlaps what we are trying to achieve, particularly their work on SQR, the Software Quality Reports analytic module for Bugzilla and JIRA, Pentaho is my preference for the dashboard project. In general, I have observed that the Pentaho community provides very good support.
-
** Bugzilla: http://sourceforge.net/projects/qareports/
+

Latest revision as of 09:35, 18 October 2011

Contents

Community Metrics Dashboard

The goal is to provide a web page summarising metrics about various aspects of the MeeGo project. The data should update regularly - depending on the metric, that could be real time or updated automatically on a regular basis.

The dashboard tracks track the following community resources, ideally:

  • Drupal members
  • Bugzilla (bugs opened, bugs closed, active users)
  • Mailing lists (members, posts, threads)
  • gitorious (commits, employer details for committers) - should use Jon Corbet's scripts like are used in the LF yearly kernel data.
  • Wiki (edits, new pages)
  • Forums (members, posts)
  • IRC (total comments, people on channel)
  • Transifex (Languages, translators, strings translated)
  • Community OBS (uploads, users)
  • SDK downloads (potentially extrapolated from meego.com)

The data should also be available for custom reports for usage and analysis in the monthly MeeGo Metrics report published by User:DawnFoster

To fulfill these goals, the dashboard will gather data from the various resource into a centralised database, using some sort of Business Intelligence platform including ETL for data acquisition and storage, and a reporting service for generating reports and dashboards.. A web page will provide a view into this database with predefined reports.

Download the monthly summary report as a Pentaho report file here: File:Meego metrics summary.prpt

Architecture

Pentaho runs as a webapp in Tomcat6. It can use a variety of databases for its internal data structures, the default (Hypersonic) is a Java database. However, because it's both standard & well understood and to allow consolidation of databases under one DB server, I prefer to use MySQL. The configuration of Pentaho with a MySQL database is a little tricky, but almost all of the steps are covered well in this tutorial.

The data which is useful for metrics will be copied into a local database from each of the services we query. The copying of data will be accomplished by a set of Kettle "xactions", which can be created and edited easily with the Spoon tool.

A number of reports will be generated using the Pentaho Report Designer, including a static HTML/Flash dashboard which will be published regularly. Other reports can be created for the community managers, and a more advanced dashboard, allowing detailed analysis of basic metrics, can be provided via the Community Dashboard Framework.

We will need to see how much load the dashboard will generate on the server. I suspect that it will not be practical to expose the dashboard in public.

We will document here everything you need to do to replicate the MeeGo Community Dashboard, with the exception of data which is not publicly available because it contains security related or confidential information (mainly bugzilla).


Extracting data

For SQL databases, we have access to the database server. This applies to MediaWiki, Bugzilla, and Drupal.

For the forum, we integrate the CSV files currently being exported, which provide the basic analytics we need. A cron job with mysqlimport is sufficient.

Individual mailing lists are parsed by MLStats. We use the resulting database directly in the dashboard.

IRC logs will be parsed with superseriousstats, a PHP command line tool that parses IRC logs and stores the results in an SQL database.

We still need to figure out how to do data interchange with Transifex and OBS, and how to get code metrics from the commit mailing list or git. Dimitris tells me that there are already some analytics available on Transifex, and that there is a RESTful API available to query this data.

Data in report

For each of the resources, the following statistics (at a minimum) should be extracted:

Mailing lists

  • Subscriber numbers (+ evolution) - from Mailman directly, not available in mlstats
  • Emails sent (+ evolution) - from mlstats
  • Active participants (individuals with >=2 emails during month) - from mlstats
  • Hot threads - from mlstats
  • Top posters - from mlstats

Useful queries

  • Count posts of most popular threads:
    select subject,year(first_date) as y, monthname(first_date),count(*) as c from messages group by subject, month(first_date) order by y, month(first_date), c;
  • Count the number of posts for each person
    select p.email_address,year(m.first_date) as y, monthname(m.first_date),count(*) as c from messages as m,messages_people as p where m.message_id=p.message_ID group by p.email_address, month(m.first_date) order by y, month(m.first_date), c;
  • Restricting queries to a date range / month
    The query above gives month-by-month totals, you could add an order by year(first_date) to get the year too
    The easiest way is to add where month(first_date)=3 and year(first_date)=2010 for March 2010. For the current month, month(m.first_date)=month(NOW()) and year(first_date)=year(NOW()) works.

Forum

  • Posts per month (+evolution)
  • Active posters (2+ posts during month)
  • Hot topics
  • Top posters

Stats are exported from the Forum in CSV format monthly.

Bugzilla

  • Bugs created (+ evolution)
  • Bugs resolved (+ evolution)
  • Comments on bugs this month
  • Active Bugzilla contributors (2+ comments during month)

Useful queries

Mediawiki

  • New wiki pages (+ evolution)
  • Edits this month (+ evolution)
  • Pages deleted this month (+ evolution)
  • Unique editors this month (+ evoluion)

Drupal

  • Members of meego.com (+ evolution month over month)
  • Active members (need a decent way to hook up different ways a person can be active: wiki, ML, IRC, forum, git)

Transifex

  • Total languages ordered by translation coverage (+ evolution)
  • Top languages
  • Top translators/teams

This all depends on what is available from Transifex.

Git

  • Commits this month (+ evolution)
  • Top committers (+ evolution)
  • Committers by company (possible)
  • Active modules (+ evolution)

Using a modified version of gitdm to dump Git logs into a MySQL database for analysis. Modifications required:

  • Create database & tables based on the gitdm data structures
  • Dump data in correct order, avoiding redundancy if possible, into SQL database

gitdm has 3 basic data structures: Hacker, Employer & Patch. Each changeset is a Patch object, each Patch has an Author, and is assigned to an Employer (based on who the Hacker was working for at the time of the Patch). Each Patch also has a list of Hackers who reviewed, reported, signed-off on and tested the patch. Each Hacker links to a list of Patches for which they are the author, a list of email addresses they have used to commit, and separate lists for reviewed, reported, SOB and tested. In addition, each Hacker has a list of Employers he has worked for, and each Employer has a list of Hackers who have worked for them.

Community & Official OBS

  • Package submissions (+ evolution)
  • Active participants (+ evolution)

Business intelligence lexicography

The area of Business Intelligence is littered with acronyms. Here's a quick overview of the main ones, and how they all fit together.

BI
Business Intelligence - general name for any middleware which allows you to query business processes (sales, inventory, etc) and get data overviews from it
ETL
Extract, Transform, and Load - the process if extracting data from a data source (database, screen scraping, text file parsing, whatever), transforming it to a well understood format, and loading it in your BI engine database or data warehouse. Good ETL solutions provide a nice way for you to connect another database and have new data sucked in at regular intervals, define views into the source data store which you can then query within your BI engine, etc. Pentaho's ETL, Kettle, and JasperETL, used by JasperReports, both provide (kind of) straightforward ways to hook into a MySQL database.
ESB
Enterprise Service Bus - a middleware bus providing a unique interface to applications on the front-end and data stores on the back end. Often used to link up many front-end applications (eg. library, student registration, employee payroll, syllabus management, accounting, supply-chain, student lodgement programmes, etc in a university). Not really useful for us, as far as I can tell.
EAI
Enterprise Application Integration - using software to integrate different applications together. As far as I can tell, this is a meaningless catch-all phrase for anything from kludges to architected business intelligence solutions.
DW
Data Warehouse. Basically the same thing as a database, as far as I can tell, but bigger and more impressive sounding.
OLAP
On-Line Analytical Processing. Commonly used acronym for extracting data via multi-dimensional queries. Databases can be configured to provide the results of this kind of query. As far as I can tell this is mostly a buzzword - an "OLAP database" like Mondrian is basically the same thing as a database. "speed-of-thought" response times indeed.
Business reporting
An application which allows a graphical view of a database, and allows you to construct queries interactively, often using drag & drop. The results of these queries can then be plugged into graphing software for presentation in a dashboard.
Dashboard
Organised presentation of information in a web-page or other similar format allowing an at-a-glance overview of the situation for the data being measured.

The community dashboard project uses a business reporting engine to query that data and present it in a report.

Comparison of candidate ETL/reporting

Modules available:

Software License ETL OLAP database BI server Reporting Dashboard module
Pentaho EPL Kettle Mondrian Pentaho BI Platform Pentaho Reporting Community Dashboard Framework
Jaspersoft AGPL v3 JasperETL (Talend Open Studio) JasperOLAP JasperReports Server iReports editor No (commercial only)

Pentaho is used as the basis of Mozilla's metrics project, and provides a very strong community software option for both the dashboard and for managing the BI server. Since Mozilla metrics work overlaps what we are trying to achieve, particularly their work on SQR, the Software Quality Reports analytic module for Bugzilla and JIRA, Pentaho is my preference for the dashboard project. In general, I have observed that the Pentaho community provides very good support.

Personal tools