Wednesday, February 29, 2012

Connection timeouts in a mirrored SQLServer

Recently, one of my teams was facing a connection timeout issue when we tried to implement 'parallelism' in a data-driven application.
A colleague of my mine pointed out that there was a bug in ADO.NET (with a mirrored SQL Server) that could result in this wierd behavior. More details available at this link.

Quick resolution is to try increasing the connection timeout and allocate a greater no of connections at start-up in the pool.

Monday, February 20, 2012

Business Intelligence vs Analytics

My collegue Sandeep Raut has a very simple blog-post explaining the differences between traditional BI and Analytics. Summarizing a few key points from the blog below.

"BI traditionally is concerned with creating reports on past data or even current live data. We create OLAP cubes using which we can slice & dice the data, even do a drill down. Analytics is about analyzing the data using mathematics/statistics to identify patterns. These patterns can then be used to predict what may happen in the future. Analytics is about identifying relationships between key data variables that were unknown before. It is about surfacing unknown patterns."

But in my humble opinion, should Analytics not be a subset of BI? I can understand the hype that product vendors create to differentiate their products in the market, but can Analytics exist in isolation to BI? Even predictive data analysis using "realt-time" data/text mining techniques would logically fall under BI....
After all BI is all about meeting business needs through actionable information !
Maybe it is just a game of words and semantics. I remember a few years back, the term DSS (Decision Support Systems) was more widely used than BI :)

Wednesday, February 15, 2012

Using Parallelism in .NET WinForm applications

We all have gone through the travials of multi-threaded programming in WinForm applications. The challenge in WinForm applications is that the UI controls are bound to the thread that created/rendered them; i.e. the UI control can only by updated by the main thread or the GUI thread that created it.

But to keep the UI responsive, we cannot execute any long running task (>0.5 sec) on the UI thread, else the GUI would hang or freeze. If we run the business logic asynchronously on another thread, then how do we pass the results back to the main GUI thread to update the UI?

Traditionally this has been done using the Control.Invoke() methods. More details on this approach is available on this link: http://msdn.microsoft.com/en-gb/magazine/cc300429.aspx

But with the introduction of TPL, there is another alternative way of doing this. We can use the TaskScheduler and SynchronizationContext classes to call heavy lifting work and then pass the results to the main GUI thread.

For e.g.
TaskScheduler uiScheduler = 
           TaskScheduler.FromCurrentSynchronizationContext();
new Task({Your code here}).start(uiScheduler);

Given below are 2 excellent articles eloborating this in detail:
http://www.codeproject.com/Articles/152765/Task-Parallel-Library-1-of-n

http://reedcopsey.com/2010/03/18/parallelism-in-net-part-15-making-tasks-run-the-taskscheduler/

Sacha Barber has an excellent 6 series article on the intricacies of TPL, which I loved reading.

Parallelism in .NET

In one of my previous blogs, I had pointed out to an interesting article that shows how TPL controls the number of threads in the Thread Pool using hill-climbing heuristics.

In order to understand why TPL (Task Parallel Library) is far superior to simple muli-threading, we need to understand the concepts of global queue, local queue on each thread, work-stealing algorithms, etc.
Given below are some interesting links that explain these concepts with good illustrations.

http://www.danielmoth.com/Blog/New-And-Improved-CLR-4-Thread-Pool-Engine.aspx

http://blogs.msdn.com/b/jennifer/archive/2009/06/26/work-stealing-in-net-4-0.aspx

http://udooz.net/blog/2009/08/net-4-0-work-stealing-queue-plinq/

A few important points to remember:
  • There is one global queue for the default Thread Pool in .NET 4.0
  • There is also a local queue for each Thread. The Task Scheduler distributes the tasks from the global queue to the local queues on each Thread. Even sub-tasks created by each Thread get queued on the local queue. This improves the performance, as there is no contention to pick up work items (tasks) from the global queue; especially in a multi-core scenario.
  • If a thread is free and there are no tasks in its local queue and also global queue, then it will steal work from other threads. This ensures that all cores are optimally utilized. This concept is called 'work stealing'.
  • Tasks from the global queue are picked up in 'FIFO' order. Tasks from the local queue are picked up in 'LIFO' order based on the assumption that the last-in is still hot in the cache. Work stealing again happens in 'FIFO' order.
There is a wonderful book on parallel computing available on MSDN that is a must read for everyone.

Monday, February 13, 2012

Data Services in the Microsoft world

In my previous blog, I ranted on the concept of Data Services in creating a data virtualization layer. In the .NET world, data services equate to WCF data services (formerly a.k.a ADO.NET data services)

Microsoft is propogating the use of an open standard called OData for building REST style data services. A good article describing OData is available on MSDN. OData essentially leverages JSON/ATOM and HTTP semantics to build a simple data services layer across disparate data sources.
But looks like besides M$, there are no big vendors jumping on the OData bandwagon. Its interesting to note that WebSphere eXtreme Scale Servers also expose a OData service.

Ruminating of Data Virtualization

The industry is flooded with confusing terms when it comes to understanding 'Data Virtualization'. We have IaaS (Information as a service), Data Services, EII (Enterprise Information Integration), Data Federation, etc. and so on! The point is that there are no industry standard definitions for these analyst-coined terms and there is a lot of overlap between them.

Rick Lans tries to clear the cloud with some simple definitions here. Another interesting post by Barry Devlin throws more light on the concept of data virtualization.

The core concept behind data virtualization is to create an abstraction layer (Data Access Layer) that hides the complexities of the underlying disparate data sources and provides a unified view of the enterprise data to the applications. This can be implemented using "SOA style" Data Services or creating a virtual data layer that can be queried using SQL-like semantics. More info can be found at these links: Link1 & Link2

RedHat has a nice whitepaper explaining the concept of Data Services in a SOA environment. This post explains the benefits of data virtualization. Composite Software is a leader in data virtualization techniques and has shared a couple of interesting case studies that demonstrate the use of their data virtualization platform.

One thought that came to my mind was regarding the challenges in accessing NoSQL data from the data virtualization layer. While some type of NoSQL datastores such as XML documents, Key/Value pairs can be exposed as a relational SQL view, it may not be possible to have a uniform query interface for unstructured data. All NoSQL data stores will expose some kind of Java API that can be used for querying. Would it be possible to create a common set of meta-data for both structured and unstructured data?
In such scenarios, IMHO, the only strategy for data virtualization is to use Data Services.

Thursday, February 09, 2012

Google Protocol Buffers

Just found a good post by the Google Engineering team ranting about the historical context of Google Protocol Buffers.
My first reaction to GPB was - "Why on earth another binary serialization format"?
I think the reason behind the popularity of GPB has been its simplicity and ease of use. 

This site has an interesting discussion on comparing GPB to XML/JSON.  A few snippets from the site comments/discussions -

  • A major difference between protocol buffers and JSON is that protocol buffers use a binary format, while JSON is plain text.  Because it's binary, the format is more compact and easier to interpret by a computer - which makes protocol buffers faster than JSON.
  • Another reason GPB is so fast is that it uses positional binding. JSON is less bloated compared to XML (which is over bloated), it still sends the name of the attribute with each record. That creates an enormous amount of overhead. PB, on the other hand, uses positional binding and doesn't send the attribute names at all.
  •  Binary protocols have to deal with portability issues like byte-order (little/big-endian) etc., there are advantages when it comes to parsing dates, timestamps, etc.

Alternatives to XML Serialization

Today, there are a lot of alternatives for XML serialization of data structures. These data interchange formats are smaller and faster than processing XML.
Most popular are Google Protocol Buffers, Thrift (from FaceBook),  Avro and MessagePack. A good article comparing these alternatives is available here -
http://www.igvita.com/2011/08/01/protocol-buffers-avro-thrift-messagepack/

Wikipedia also has an interesting article comparing various data serialization.