(MediaWiki queries) |
(→Queries) |
||
| Line 176: | Line 176: | ||
For the following group-by-month queries, I did a cross join of (2008,2009,2010,2011) and (01-12) to generate a "year and month" data table. | For the following group-by-month queries, I did a cross join of (2008,2009,2010,2011) and (01-12) to generate a "year and month" data table. | ||
| - | ''Top editors by month:'' | + | '''Top editors by month:''' |
SELECT mon.timestamp_year AS yyyy, | SELECT mon.timestamp_year AS yyyy, | ||
mon.timestamp_month AS mm, | mon.timestamp_month AS mm, | ||
| Line 188: | Line 188: | ||
ORDER BY yyyy,mm,c desc; | ORDER BY yyyy,mm,c desc; | ||
| - | ''Number of edits by month:'' | + | '''Number of edits by month:''' |
SELECT mon.timestamp_year as yyyy, | SELECT mon.timestamp_year as yyyy, | ||
mon.timestamp_month as mm, | mon.timestamp_month as mm, | ||
| Line 197: | Line 197: | ||
GROUP BY yyyy,mm; | GROUP BY yyyy,mm; | ||
| - | To get the number of new pages per month is a bit trickier - first we need to query $ | + | '''New pages per month:''' |
| + | To get the number of new pages per month is a bit trickier - first we need to query $revision to get the page_ids and their date of creation, then group by date. The query is O(n²) on the number of pages, although it should be possible to make it O(n) by grouping the result of the subquery without doing in() on the list of timestamps. | ||
| + | |||
| + | SELECT mon.timestamp_year as yyyy, | ||
| + | mon.timestamp_month as mm, | ||
| + | COUNT(*) | ||
| + | FROM mw_revision as rev, | ||
| + | years_months as mon | ||
| + | |||
| + | WHERE rev.rev_timestamp LIKE CONCAT(CONCAT(mon.timestamp_year,mon.timestamp_month),'%') | ||
| + | AND rev.rev_timestamp in ( | ||
| + | SELECT MIN(rev_timestamp) | ||
| + | FROM mw_revision | ||
| + | GROUP BY rev_page) | ||
| + | GROUP BY yyyy,mm; | ||
| + | |||
| + | To get just the list of pages & timestamps (this is used as the subquery for above): | ||
| + | SELECT rev_page as p, | ||
| + | MIN(rev_timestamp) as t | ||
| + | FROM mw_revision | ||
| + | GROUP BY rev_page; | ||
=== Transifex === | === Transifex === | ||
Contents |
The goal is to provide a web page summarising metrics about various aspects of the MeeGo project. The data should update regularly - depending on the metric, that could be real time or updated automatically on a regular basis.
The dashboard will track the following community resources, ideally:
The data should also be available for custom reports for usage and analysis in the monthly MeeGo Metrics report published by User:DawnFoster
To fulfill these goals, the dashboard will gather data from the various resource into a centralised database, using some sort of Business Intelligence platform including ETL for data acquisition and storage, and a reporting service for generating reports and dashboards.. A web page will provide a view into this database with predefined reports.
Candidate reporting solutions:
The following are essentially ETL engines, and do not provide reporting or dashboard functionality:
MuleSoft is an open source ESB, but does not seem adapted to our needs. The field is thus narrowed to Pentaho and JasperReports.
For each community resource, we need to figure out how to get the data into a usable form, and come up with appropriate queries for metrics reports, and finally present the results on a webpage.
The area of Business Intelligence is littered with acronyms. Here's a quick overview of the main ones, and how they all fit together.
So, in short, the community dashboard project will likely use an ETL to plug data into an OLAP server, and then use a business reporting engine to query that data and present it in a dashboard.
Modules available:
| Software | License | ETL | OLAP database | BI server | Reporting | Dashboard module |
|---|---|---|---|---|---|---|
| Pentaho | EPL | Kettle | Mondrian | Pentaho BI Platform | Pentaho Reporting | Community Dashboard Framework |
| Jaspersoft | AGPL v3 | JasperETL (Talend Open Studio) | JasperOLAP | JasperReports Server | iReports editor | No (commercial only) |
Pentaho is used as the basis of Mozilla's metrics project, and provides a very strong community software option for both the dashboard and for managing the BI server. Since Mozilla metrics work overlaps what we are trying to achieve, particularly their work on SQR, the Software Quality Reports analytic module for Bugzilla and JIRA, Pentaho is my preference for the dashboard project. In general, I have observed that the Pentaho community provides very good support.
Pentaho runs as a webapp in Tomcat6. It can use a variety of databases for its internal data structures, the default (Hypersonic) is a Java database. However, because it's both standard & well understood and to allow consolidation of databases under one DB server, I prefer to use MySQL. The configuration of Pentaho with a MySQL database is a little tricky, but almost all of the steps are covered well in this tutorial.
The data which is useful for metrics will be copied into a local database from each of the services we query. The copying of data will be accomplished by a set of Kettle "xactions", which can be created and edited easily with the Spoon tool.
A number of reports will be generated using the Pentaho Report Designer, including a static HTML/Flash dashboard which will be published regularly. Other reports can be created for the community managers, and a more advanced dashboard, allowing detailed analysis of basic metrics, can be provided via the Community Dashboard Framework.
We will need to see how much load the dashboard will generate on the server. I suspect that it will not be practical to expose the dashboard in public.
Related: Bug to create the service, describing required access to existing services
| 13588 | Undecided | ASSIGNED | normal | unspecified | General Requests | Request installation of Tomcat6 and access to install Pentaho, and access to MySQL databases | [1] |
For SQL databases, this implies that the server where the dashboard will run should have access to the database server for MediaWiki, Bugzilla, and Drupal.
For the forum, we will integrate the CSV files currently being exported, which provide the basic analytics we need.
Individual mailing lists will be parsed by MLStats. We will use the resulting database directly in the dashboard.
Git repositories will be queried with "git log", and parsed with the parser module from gitdm, before being stored directly in a database. we will be able to run analytics on the results from there. gitdm can also do basic analytics of git logs, and we may decide to simply reuse gitdm's analytics. However, if we want to extend them, we will want to have the raw data.
IRC logs will be parsed with superseriousstats, a PHP command line tool that parses IRC logs and stores the results in an SQL database.
We still need to figure out how to do data interchange with Transifex and OBS. Dimitris tells me that there are already some analytics available on Transifex, and that there is a RESTful API available to query this data.
For each of the resources, the following statistics (at a minimum) should be extracted:
select subject,year(first_date) as y, monthname(first_date),count(*) as c from messages group by subject, month(first_date) order by y, month(first_date), c;
select p.email_address,year(m.first_date) as y, monthname(m.first_date),count(*) as c from messages as m,messages_people as p where m.message_id=p.message_ID group by p.email_address, month(m.first_date) order by y, month(m.first_date), c;
where month(first_date)=3 and year(first_date)=2010 for March 2010. For the current month, month(m.first_date)=month(NOW()) and year(first_date)=year(NOW()) works.
Stats are exported from the Forum in CSV format monthly.
A MediaWiki extension exists to provide "user scores" for MediaWiki users, ordered by number of edits and number of pages changed. The guts of the query is:
SELECT COUNT(wr.rev_id) as value,
COUNT(DISTINCT wr.rev_page) as page_value,
wu.user_name as name,
wu.user_real_name as real_name
FROM $user wu,
$revision wr,
$page wp
WHERE wu.user_id = wr.rev_user
and wp.page_id = wr.rev_page
and wp.page_namespace = 0
GROUP BY wu.user_name
ORDER BY value desc;
where $user, $revision and $page are the names of the respective MediaWiki tables (MediaWiki tables have a prefix associated with them for a given instance, specified by $wgDBprefix in LocalSettings.php).
For the following group-by-month queries, I did a cross join of (2008,2009,2010,2011) and (01-12) to generate a "year and month" data table.
Top editors by month:
SELECT mon.timestamp_year AS yyyy,
mon.timestamp_month AS mm,
rev_user_text AS user,
COUNT(*) AS c
FROM $revision AS rev,
years_months AS mon
WHERE rev.rev_timestamp LIKE concat(concat(mon.timestamp_year,mon.timestamp_month),'%')
GROUP BY yyyy,mm,user
HAVING c>5
ORDER BY yyyy,mm,c desc;
Number of edits by month:
SELECT mon.timestamp_year as yyyy,
mon.timestamp_month as mm,
COUNT(*) AS edits
FROM $revision AS rev,
years_months as mon
WHERE rev.rev_timestamp LIKE concat(concat(mon.timestamp_year,mon.timestamp_month),'%')
GROUP BY yyyy,mm;
New pages per month: To get the number of new pages per month is a bit trickier - first we need to query $revision to get the page_ids and their date of creation, then group by date. The query is O(n²) on the number of pages, although it should be possible to make it O(n) by grouping the result of the subquery without doing in() on the list of timestamps.
SELECT mon.timestamp_year as yyyy,
mon.timestamp_month as mm,
COUNT(*)
FROM mw_revision as rev,
years_months as mon
WHERE rev.rev_timestamp LIKE CONCAT(CONCAT(mon.timestamp_year,mon.timestamp_month),'%')
AND rev.rev_timestamp in (
SELECT MIN(rev_timestamp)
FROM mw_revision
GROUP BY rev_page)
GROUP BY yyyy,mm;
To get just the list of pages & timestamps (this is used as the subquery for above):
SELECT rev_page as p,
MIN(rev_timestamp) as t
FROM mw_revision
GROUP BY rev_page;
This all depends on what is available from Transifex.
I have not yet considered how I might get web analytics and download stats.