Not sure if this is my last post for GSoC’12 but I am happy to announce that my contribution to Semantic MediaWiki over the summer as part of my Google Summer of Code project GreenSMW is now released as SMW 1.8 (http://semantic-mediawiki.org/wiki/Semantic_MediaWiki_1.8.0_released)

Overall this summer has been a life changing time for me, from being introduced to a new world of open source software to introducing others to the same. I also had the opportunity to give a talk on my work for SMW at the SMWCon in October (slides can be found at http://semantic-mediawiki.org/wiki/SMWCon_Fall_2012/Improvements_in_SQLStore3/Presentation). Later I attended the Wikipedia DevCamp and got to meet with lot of developers from WMF and other volunteers; its an amazing world out there. Hopefully, my contributions to open source continues like this forever šŸ™‚

http://semantic-mediawiki.org/wiki/Semantic_MediaWiki_1.8.0_released

This post is a after-completion summary of my GSoC project GreenSMW

What was the idea of this project?

The original proposal can be found atĀ http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/nischayn22/1

The main deliverables proposed there were

  • Validation of writes using a hash
  • Caching of Special Pages
  • IdentificationĀ and cachingĀ of frequently made queries inĀ Special:ExportRDF
  • Improvements to SMW’s accesses to the database.
  • Identification and caching of large inline queries or complex templates using memcache
  • Profiling and documentation

What part of this has been achieved, what was left behind?

  • Validation of writes using a hash — Done (was very easy to do, got completed in very early)
  • Improvements to SMW’s accesses to the database. — Done (this was the most complicated task as it involved lots of refactoring of old code)
  • Caching of Special Pages — An alternative strategy is being applied here, the Special page methods are made very efficient and now don’t need any caching as such. (yet to commit this change)
  • IdentificationĀ and cachingĀ of frequently made queries inĀ Special:ExportRDF — This was later identified as very low priority as many more places were identified to improve.
  • Identification and caching of large inline queries or complex templates using memcache — This task was later identified as not so trivial, memcache uses time based caching, which is not a good solution for query as they involve lots of invalidation. We planned to work on a different technique to invalidate queries by storing their metadata, this is a bigger task and we decided to do it post GSoC. However, users can use a memcache based approach till then as MWJames has been usingĀ http://wikimedia.7.n6.nabble.com/Re-Query-result-caching-and-invalidation-Jeroen-De-Dauw-td4981469.html#none
  • Profiling and documentation — Mostly done, but more part to be done when SMW 1.8 is going to be released.

What was not in the plan (we don’t have plans for everything, do we?)

  • Unit Tests — We covered some parts of SMW’s code using PHPUnit tests.
  • Fixed Properties — Side product of re-organizing the DB stuff, wiki admins can assign separate tables for highly used properties so querying takes little time on those.
  • Migration Script — A script to let users actually switch to SMW 1.8 without disrupting their site’s activity.
  • Semantic diff and site stats — Not fully mature stuff, Ā but SMW will now be able to produce a diff of the Semantic data, and also store stats of Property usage.

What do you consider the best aspect of participating in GSoC?

The best aspect of participating was contributing to a project that hundreds of people use. Besides, this opportunity gave me immense exposure to the process of Software Development in Open Source

What do you consider the most challenging part of your summer?

Working with existing code was a challenge. I changed something here and it broke something there, such issues occurred many times.

How were your mentors?

Awesome, having two mentors was really beneficial.

Which tips would you give to future students?

Talk to previous year students, talk to mentors as early as possible. Don’t be intimidated by big source codes šŸ˜›

What one thing did the Wikimedia community do that you consider very
helpful for your project and would suggest they continue to do?

Developers at Wikimedia have been very helpful throughout, they maintain a friendly atmosphere that welcomes more contributors.Ā I am also thankful to WikimediaĀ Deutschland for funding my travel to SMWCon in Germany.

 

This week I made the migration scripts so users of SMW can actually use all my work without any hassle of doing a refreshData.php and wait for hours for it to complete. This script will let users to directly migrate all their data from the SMW’s older version to the newer version while still the site could run uninterruptedly with the older version, thus one can safely switch to the newer version at the end and with full control on the user’s part. I am yet to document this migration process but I promise it is very very straightforward and hopefully not time consuming.

Besides, this week I did a few small tweaks that had been left notice sometime or were marked as TODO.

The following weeks we are planning to start working on caching of inline queries, this has many challenges coming for us, and we are expecting this to only be available in a later version (For more information browse semanticmediawiki-devel thread with subject Query caching and Invalidation.)

This week I worked on obtaining a diff of the Semantic data for a wiki page, this shows you whatĀ  *semantics* got added and what *semantics* got removed on a page write. This isĀ  somewhat similar to what SemanticWatchList does but now in core and more efficient.

SWL view

As a result SemanticWatchList shall be using this from core SMW in the newer versions and thus SMW+SWL will be more efficient to use together.
However, this wasn’t a plan to improve SWL initially. This is in result of my work behind improving SMW’s Special pages and maintaining more statistics; we will now have counts for usage of Properties in the database and thus Special:Properties and Special:UnusedProperties will just query this table instead of querying all the tables as it does now (we always keep realizing how stupid we were before :P).

I also plan to use this counts of Properties usage (since we already have them now:)) to make a few more special pages for SMW. I have in mind right now RecentSemanticChanges only; however this is currently out of scope for the project and so might be only suitable as some post-GSoC work.

Last few days I have done some major improvements to Special pages for SMW.

The old way of generating SpecialProperties involved unnecessary joins and limited result (some result were omitted), the new method is much simpler, I only query the required information in multiple simple queries which should be much much faster than the old way.

After this I also did some analysis on using Indexes in the db tables to see how MySQL worked with the new method and got amazing results.

The following query is done with no indexes (as is now), it runs very slow scanning aboutĀ 10,724 rows to return only 250 of them. Slow Indeed šŸ˜‰

mysql> explain select * from store3.smw_ids where smw_namespace=102 order by smwĀ _sortkey limit 200,50;

+—-+————-+———+——+—————+——+———+——+—
—-+—————————–+
| id | select_type | table | type | possible_keys | key | key_len | ref | ro
ws | Extra |
+—-+————-+———+——+—————+——+———+——+—
—-+—————————–+
| 1 | SIMPLE | smw_ids | ALL | NULL | NULL | NULL | NULL | 10
724 | Using where; Using filesort |
+—-+————-+———+——+—————+——+———+——+—
—-+—————————–+
1 row in set (0.00 sec)

Next is my new way using index Ā (smw_namespace, smw_sortkey) runs faster by scanning only 1010 rows to return the same result of 250.

mysql> explain select * from store3.smw_ids where smw_namespace=102 order by smw
_sortkey limit 200,50;
+—-+————-+———+——+—————+——-+———+——-+-
—–+————-+
| id | select_type | table | type | possible_keys | key | key_len | ref |
rows | Extra |
+—-+————-+———+——+—————+——-+———+——-+-
—–+————-+
| 1 | SIMPLE | smw_ids | ref | NS_SK | NS_SK | 4 | const |
1010 | Using where |
+—-+————-+———+——+—————+——-+———+——-+-
—–+————-+
1 row in set (0.00 sec)

Still more

While the new method is really fast we still need to cache stuff as this is still expensive for really large wikis

SMW now has about 28+ db tables, yes you heard it right and that’s where my bragging rights for DB Sharding comes from šŸ˜‰ , read ahead to know more –

SMW now has about 28 db tables, this has been a major uplift from what we had initially. Ā  Ā The more tables we have the less things to load into memory at a time, lesser storage requirement (no foreign keys) and faster querying (again no foreign keys to compare). Plus there’s an awesome feature that lets you to shard the db even more (the + in 28+), you can assign separate tables for each of your property (yes, you-the admin can control it now šŸ˜‰ ), this is suggested for properties that are highly used in your wiki (now don’t you worry how to figure this out yet), This way you can haveĀ up toĀ as many tables as you want. But still beware there might be bugs to fix.

Ok, now back to how do you know what properties will be used extensively in the future? this can be an estimate, say a property email is more likely to be used extensively so it should be marked as a fixed-property (yet to document on this stuff). But if you don’t want all this gibberish you can mark a property as fixed-property even when its in use, but then you would have to run a SQL script after that so your site can work perfectly ok (Btw this script is yet to be created, but is rumored to be fairly simple to create)

Ok, now the testing part. I have done some intensive testing using my own written tests for SMW plus on a local wiki (which has a massive import from semanticweb.org) and bugs are identified mostly but there might be a few of them still hiding for you. If you are reading this and want to help us to test stuff please do it! Ā All you need to do is setup a local wiki with SMW installed (do check out from the master and then make the default store SQLStore3) then import from a highly used SMW wiki (which I assume you own if you have read this far :P) and then after the import is done run a refresh.php to refresh all your SMW content. After that I know you are gonna comment here about the bugs šŸ˜€

Hi, the midterm evaluation has come (time flies doesn’t it?) and finally I am able to put my code to master branch. I had expected this to be a tiresome job with lots of “Git merge conflicts” but thanks to my mentor Markus who suggested to keep the older code along with the newer one to support easierĀ upgradationĀ for users; this led to only copying the new files into a separate folder on the store. Here’s the link to this latest commit of mineĀ https://gerrit.wikimedia.org/r/#/c/14773/

But beware when you read that line on the page below Ā which says “+6647, -7” . This isn’t true actually :p as the whole lot of code isn’t my work, its mostly inherited code from the older store, with my tweaking at various places. I am looking forward to the next tasks upcoming now as I shall discuss with my mentor, maybe its time for some caching now šŸ™‚

This Update got a little late, but let me sum it up.

New Stuff – UnitTests, DIHandler classes, new table for Booleans

And now the rants start šŸ˜‰

Before I applied to GSoC I had showed interest in Jeroen’s proposal for writing UnitTests for SMW, now that I chose GreenSMW he is still making me write tests ;), which is actually very useful to test broken stuff rather than our initial attempts of testing by running a wiki. So, when I get stuck for a while waiting for some commit to merge I write tests. With this I am able to make very good use of my “paid time”. I recommend this for all GSoCers šŸ˜€

Now back to the Booleans, after a little research on Boolean datatype I finally settled on using TINYINT(1) for MySQL and BOOLEAN for PgSQL, I checked this stuff and it works. Ā Performance wise we are using very little space for Booleans now and this also makes reading and writing such values faster.

I was suprised that this is not a popular datatype for SMW users after I discovered a bug in SMW’s handling of Booleans.

Also, we discussed about releasing the storerewrite version and though Markus suggested to release it as SQLStore3 keeping also SQLStore2, I and Jeroen pressed on releasing it as just SQLStore, without any redundant code from SQLStore2. There might be a migration script to help users upgrade without any side effects but only if time permits.

Wow its been a month already, time flies!

This week I abstracted (read as separated) the DB layout for SMW. This will make it easier to add new DI types in the future and modify the existing ones. From here follows a series of update to the DB schema.

Till now, as new DI types were added the code was modified to support them *somehow* by using similar naming of fields for most DI types and using someĀ unnecessaryĀ fields in many places (Boolean field uses two columns but one is enough and Ā Geo uses three fields where again one is enough). But now that the DB is abstracted at each DI layer, DB access could be easily modified to enhance performance Ā for each DI separately and Ā also merge the DI type String with Blob (will elaborate on this more in future posts).

All the changes to SMW as part of my GSoC project are done on a separate branch ‘storerewrite’. You can see my changes on GerritĀ https://gerrit.wikimedia.org/r/#/q/owner:Nischayn22,n,z