Tech Talk

Wednesday, September 04, 2013

NoSQL for Customer Hub MDM

The following article on informationweek is an interesting read on the use of MongoDB NoSQL for building a customer MDM solution.

http://www.informationweek.com/software/information-management/metlife-uses-nosql-for-customer-service/240154741

MongoDB being a document oriented NoSQL database has its core strength in maintaining flexible schemas and storing data as JSON or BSON objects. Lets look at the pros and cons of using MongoDB as a MDM solution.

One of the fundamental challenges faced is creating a customer hub is the aggregation of disparate data from a variety of different sources. For e.g. a customer could have bought a number of products from an insurance firm. Using a traditional RDBMS would entail complexities of joining the table records and fulfilling all the referential constraints of the data. Also each insurance product may have different fields and dimensions. Should we create a table for each product type? In MongoDB, you can store all the policies of the customer in one JSON object. You can store different types of policy for each customer with full flexibility and maintain a natural hierarchy (parent-child) of relationships.

Another problem that Insurance firms face is that of legacy policy records. Certain insurance products such as Annuity have a long life period,but a lot of regulations and business needs change over the years and your old policy records may not have all the fields that are captured in new policy records. How do you handle such cases? Having a strict schema would not help and hence a solution like MongoDB offers the necessary flexibility to store spare data.

MongoDB also has an edge in terms of low TCO for scalability and performance. Its auto-sharding capabilities enable massive horizontal scalability. It also supports OOTB memory-mapped files that is of tremendous help with the prominence of 64-bit computing and tons of available RAM.

On the negative side, I am a bit concerned about the integrity of data in the above solution. Since there is no referential integrity, are we 100% confident on the accuracy of data? We would still need to use data profiling, data cleansing and data matching tools to find out unique customers and remove duplicates.

Metlife is using this customer hub only for agents and has not exposed this data to the customers as there are concerns about data integrity and accuracy. But what if we need to enable the customer to self-service all his policies from a single window on the organizations portal ? We cannot show invalid data to the customer.

Also from a skills perspective, MongoDB needs specialized resources.Its easy to use and develop, but for performance tuning and monitoring you need niche skills.

Gender Diversity in the Architecture Field

Being passionate about gender diversity, I have always been concerned about the under-representation of women in the software architecture field. Over the years, I have endeavored to motivate and inspire my female colleagues to take up leadership roles in the technology stream; but in vain.

I have often introspected on the reasons why women don’t take or don’t make it to senior leadership roles in the enterprise architecture domain. Popular opinions range between the polarized extremes of “lack of interest” to “lack of competence” or both. I strongly beg to differ on the false assumption that women lack the logical skills to make good architects. In my career, I have seen brilliant women intellectuals with very strong programming and design skills. Women also tend to have better “EQ” (Emotional Intelligence) than men in general and this tremendously helps in areas such as decision-making, stakeholder communication and collaboration, conflict management, etc. So the “lack of competence” excuse is only for lame male chauvinists.

I have mixed opinions on the “lack of interest” argument. Today we have compelling scientific evidence that proves that there are fundamental differences in the way the brains of men and women are hardwired. If you are not convinced on this, please peruse the books of John Gray (http://www.marsvenus.com). Many of his books were an eye-opener for me :). Considering these gender differences, in the way our brains are structured, can we make a generalized statement that most women are not passionate enough about cutting edge technology or software architecture? For e.g. when you get a new blue-ray player, or media server or any electronic gadget at home, who is the one to fiddle with it till all the functions are known? - the husband or wife? the son or daughter? Who watches family soap operas and who watches hi-tech action movies? Are men is general more interested in technology than women? Or is it because of lack of opportunities and gender bias? I don't have a clear answer, but I know for sure that mother nature has hardwired our brains differently. Family responsibilities and children upbringing is another challenge that must be forcing many women to make a choice on what's most important to them?

Maybe it’s time to change our preconceived notions about leadership and not equate it with aggressiveness and other ‘alpha-male’ characteristics? Lack of role models also proves to be detrimental in motivating women to pursue a technical career path in the architecture field. But this is a “chicken-n-egg” problem and an initial momentum is required to correct this.

Today’s world needs software architects with versatile skills and not just hard-core technical skills. We need architects who are better at brainstorming and collaboration, who can build on the ideas of others rather than aggressively push one’s own idea. In the Agile world, collaboration and communication is a key skill and women have a natural advantage in these areas.

What should be done to encourage more women to take up careers in software architecture and design field? What proactive steps can be taken to bridge this diversity gap? Your thoughts are welcome.

Friday, August 23, 2013

Agile Survey

VersionOne has released the results of the survey it conducted on Agile practices in the industry. The report is available at: http://www.versionone.com/pdf/7th-Annual-State-of-Agile-Development-Survey.pdf

Quite an interesting read and worth a perusal.

Tuesday, August 06, 2013

Ruminating on Single Page Applications

In the past, I had blogged about the categorization of UI frameworks and the design philosophy behind them.
Of late, the open source market has been flooded with a plethora of SPA (Single Page Application) JavaScript frameworks. Most of these frameworks implement the MVC, MVP or the MVVP patterns on the client side and integrate with the back-end services using REST & JSON.

We are evaluating the following frameworks, that have gained popularity over the past few months. IMHO, Gmail is still the best SPA application out there :)

http://spinejs.com/
http://knockoutjs.com/
http://durandaljs.com/ (based on knockout.js)
http://angularjs.org/ (Google's baby for SPA)
http://backbonejs.org/
http://marionettejs.com/ (based on backboneJS)
http://emberjs.com/

A good comparison of the various frameworks is given here. IMHO, so many frameworks cause a lot of confusion and developers spend a lot of time comparing the features and choosing the best fit.

In SPA, all of the HTML, JS and CSS is downloaded in the first request. All subsequent requests are AJAX requests and only retrieve data from services and update the UI. The browser loads the JS files, requests the data from servies and then generates the HTML DOM dynamically.
The obvious advantage of SPA is the high performance of the web application and the seamless look-n-feel of the website as that of an mobile app. For SPA, there are challenges around SEO and browser history (back button) that needs to be addressed within the app.

Amazon S3 for websites

I was always under the impression that the Amazon AWS S3 service could be used only for storing any type of media content on the cloud. For e.g. images, JS, CSS, video files, etc.

But what surprised me is that we can host an entire static web site on an Amazon S3 bucket. No need to use a web server on an EC2 instance. Caveat Emptor: S3 does not support dynamic web sites.

The following links are handy to understand how to configure a S3 bucket for static web hosting. You can also pick up a good domain name for your static site using the domain name services provided by Amazon’s “Route 53″ DNS service. Amazon also offers a Content Delivery Network service(CloudFront CDN) that replicates content across Amazon’s network of global edge locations. So the S3 bucket would serve as the origin server and the CDN would provide the edge servers.

http://docs.aws.amazon.com/AmazonS3/latest/dev/WebsiteHosting.html
http://chadthompson.me/2013/05/06/static-web-hosting-with-amazon-s3/
http://www.labnol.org/internet/web-hosting-with-amazon-s3/18742/
http://www.labnol.org/internet/lower-amazon-s3-bill-improve-website-loading-time/5193/

Tuesday, July 23, 2013

Financial Advisors and Social Media

A lot of financial advisors are using Social Media to interact with their customers and engage them. But as pointed out on this link, there are certain SEC regulatory constraints that financial advisors should be wary about while using social media.

First, all disclosures (earnings, operations, etc.) should be made available to all parties at the same time. We cannot share information only on social media and not through other channels.
The second challenge is around handling negative comments on social media sites. How to officially respond? Can anyone put comments on the Facebook page?
The third important facet is that of the customers privacy. If an advisor responds to a customers question on twitter or facebook and inadvertently discloses some kind of financial information about the customer, then it could be considered as a violation of privacy by SEC.

Also from a regulatory perspective, it could be required to store all communication and retain them for future. There are a number of players in the market that provide services for achieving all social media interactions. E.g.
https://www.backupify.com/products/personal-apps-backup
http://www.socialware.com/products-services/socialware-compliance/
http://www.symantec.com/advisormail/
http://www.smarsh.com/social-media-archiving

Sunday, July 14, 2013

How to get geographical location from IP address?

There are a lot of free services available that would enable you to roughly get the geographical address of your ISP provider from your IP address. To get the exact street address, it would be required to contact the ISP provider and get further details.

There is a popular IP address mapping software used by enterprises called Quova, that has been renamed to Neustar IP Intelligence now. More information can be found at this link.

IP geo-location software can be used to detect fraud and target advertising. Almost all online e-commerce stores use such services or products.

Street address normalization for MDM solutions

Quite often we need to verify street addresses or normalize them to check for duplicate customers. The following article gives a good overview of the various techniques we can use for this problem context.

http://brizzled.clapper.org/blog/2012/02/14/simple-address-standardization/

There are also a lot of commercial and open source software for address verification and geo-coding. For e.g.
http://smartystreets.com/
http://www.addressdoctor.com/en/
http://www.melissadata.com/

Saturday, July 13, 2013

Ruminating on "Headless" concepts

Today afternoon, within a span of 2 hours, I have heard the term 'headless' being put ahead of multiple words. For e.g. headless system, headless app, headless service and finally headless UI testing !

A headless system is essentially a system with no monitor and IO components (mouse, keyboard, etc.).
A headless app is an application that does not have a UI to interact with - analogous to background demon threads, etc. A headless service only has backend logic and no frontend UI.

So essentially the team headless is used to address the concept of not having any user interface. So what does headless UI testing mean?
Check out PhantomJS - a headless testing Webkit that makes this possible. A good article summarizing this concept is available here.

Thursday, July 11, 2013

Importance of Geocoding for business

Geocoding is the process of finding out the geographical coordinates (latitude/longitude) from street address, post code, etc. A lot of organizations are interested in geocoding their customer addresses, because it enables them to serve the customer better. For e.g.

A healthcare provider can use the geocoding information of its customers, to help them locate the nearest physician or pharmacy.
An insurance firm can use geocoding information to find out the actual physical location of an insured property and determine the underwriting risk for floods, earthquakes, etc.
E-commerce sites usually have a find-a-nearby-store option that enables customers to find out the nearest store to pick up their goods from, based on their GPS coordinates.

Thus geocoding can help a business in answering many questions that would help it drive growth. For e.g.

What geographical area to most of our customers come from?
Are there geographical areas where we have not penetrated? If yes, Why?
Is our sales force aligned with our customer territories?

Householding and Hierarchy Management

Found this rather interesting article on householding concepts in MDM. Many organizations struggle to define business rules to identify customers belonging to a same home or household.

http://www.information-management.com/news/1010001-1.html?zkPrintable=1&nopagination=1

Another interesting read is this blog post on how "http://www.muckety.com" is using hierarchy management to link VIPs/actors together. The graphs on muckety.com are interactive and worth a perusal if you are a hollywood buff :)

Wednesday, July 10, 2013

Ruminating on Digital Transformation Initiatives

A lot of organizations are embarking on multi-year, multi-million dollar value, digital transformation initiatives.
Many large organizations grew by M&A and this has resulted in multiple brands and disparate web properties with inconsistent user experience. Consolidation of the various web domains/properties and brands is the primary business driver behind digital initiatives. The philosophy of having "One Face to the Customer" is at play here.

For e.g. an insurance firm may have different LOBs across life, auto & property and retirement. In large organizations, each LOB operates as a separate entity and have their own IT teams. In such cases, creating a shared services team for digital initiatives makes sense.

Each LOB could have their own web presence and different corporate branding. Many a times, end customers do not realize that the different products/policies that they have bought belong to the same insurance firm. The customers are always routed to different websites and have separate logins for each site.

For the insurance firm's perspective, there is no 360-degree view of the customer. This severely limits their ability to service customers and aggressively cross-sell or up-sell to them. To resolve this business challenge, organizations should embrace a paradigm shift in their thought process - from being product/policy centric to being customer centric. From an IT perspective, this could entail creating a Customer Hub (MDM) and having a consolidated customer self-service portal that would serve as a single window for servicing all of the customer policies or products.

Having a customer MDM solution would also enable organizations to run better analytics around demographic information and past customer behaviour. This in turn, would help in delivering a more personalized user experience and fine grained marketing.

Another important driver for digital transformations is the need to support multi-channel delivery. Today content delivery to end users on mobiles, tablets are considered as table-stakes. Defining and executing an effective mobile strategy is of paramount importance in any digital initiative.

Organizations are also actively looking at leveraging Social Media and Gamification techniques to engage better with customers. It's also important to choose a powerful Content Management tool that would enable faster go-to-market for digital content changes; controlled and owned by business, rather than IT.

Thursday, June 20, 2013

WS-Security Username Token Implementation using WCF

The following article on microsoft site is an excellent tutorial for beginners looking to use open standards such as WS-Security to secure their WCF services. Perusal highly recommended.

WS-Security with Windows Communication Foundation

Tuesday, June 18, 2013

Contracts in REST based services

Traditionally REST based services did not have formal contracts between the service consumer and service provider. There used to be a out-of-band agreement between them on the context of the message being passed.

Also the service provider (e.g. Amazon) would publish some API libraries and sample code across popular languages such as Java. C#.NET, etc. Most developers would easily understand how to use the service by looking at the examples.

Sometime back, there was a debate on InfoQ on the topic of having standards for describing contracts for REST based services. There were interesting differences of opinion on this.

There was a standard defined called WADL that was the equivalent of WSDL for REST based services. Apache CXF supports WADL, but I have not seen many enterprises embracing this. Also WADL supports only XML payloads. What about JSON payloads?

I like the DataContract abstraction in .NET WCF. Using WCF configuration, we can specify where the binding should happen as XML or JSON in a REST service.

Monday, June 17, 2013

Ruminating on Claims based Identity

Most folks still stick with RBAC (Role Based Access Control) mechanisms for enabling security in their applications. A Claims based Identity solution is more comprehensive than RBAC and offers much more flexibility in implementing security.

In RBAC, typically the onus of authenticating users and checking permissions lies on the application itself. In Claims based solutions, the security constraints of the application are decoupled from the application business logic. The application receives a security token from a STS (Security Token Service) it trusts and thus does not have to worry about authenticating the user or extracting security related info regarding the user. All the required information is available in the STS security token as a set of claims.

Thus a Claims based Identity solution decouples of application from the complexities of authentication and authorization. Thus the application is isolated from any changes to the security policies that need to be applied.

The following articles are of great help to any newbie in understanding the fundamentals of Claim based Identity solutions.

A Guide to Claims Based Identity - An excellent guide to understand some fundamental concepts around tokens, claims and STS.

Microsoft Windows Identity Foundation (WIF) Whitepaper for Developers - A very good article around WIF basics and also includes sample code to extend IPrinciple objects and intercept security token processing.

Claims Based Architectures - One of the best online articles that explains how Web SSO and thick client SSO can be implemented using Claims.

Tuesday, June 11, 2013

Ruminating on Data Masking

A lot of organizations are interested in 'Data Masking' and are actively looking out for solutions around the same. IBM and Informatica Data Masking tools are leaders in Gartner's magic quadrant.

The need for masking data is very simple - How do we share enterprise data that is sensitive with the development teams, testing teams, training teams and even the offshore teams?
Besides masking data, there are other potential solutions for the above problem - i.e. using Test Data Creation tools and UI playback tools. But data masking and subsetting continue to remain popular means of scrambling data for non-production use.

Some of the key requirements for any Data Masking Solution are:

Meaningful Masked Data: The masked data has to be meaningful and realistic. It should be capable of applying and satisfying all the business rules. For e.g. post codes, credit card numbers, SSN, bank account numbers, etc. E.g. if we change DOB, should we also change 'Age'.
Referential Integrity: If we are scrambling primary keys then we need to ensure that the relationships are maintained. One technique is to make sure that the same scramble functions are applied to all of the related columns. Sometimes, if we are masking data across databases, then we would need to ensure integrity across databases.
Irreversible Masking: The masked data should be irreversible and it should be impossible to recreate sensitive data.

A good architecture strategy for building a data-masking solution is to design a Policy driven Data Masking Rule Engine. The business users can then define policies for masking different data-sets.

A lot of data masking tool vendors are now venturing beyond static data masking. Dynamic Data Masking is a new concept that masks data in real time. Also there is a growing demand for masking data in unstructured content such as PDF, Word or Excel files.

Wednesday, June 05, 2013

Data Privacy Regulations

As architects, we often have to design solutions within the constraints of data privacy regulations such as HIPAA, PCI, UK Data Protection Act, SOX, etc.

The exact data privacy requirements differ from one regulatory act to the other. But there are some common themes or patterns, as defined below, that would help us to structure our thoughts, when we think of protecting sensitive data using technology solutions.

Data at Rest: Protecting all data at rest in databases, files, etc. Most databases offer TDE features. Data in flat files need to be encrypted; either by using file encryption of disk volume encryption. Another important aspect is data on portable mobile/tablet devices. Also data on portable media such as USB, CDs, DVDs needs to be considered.
Data in Motion: Use secure protocols such as Secure-FTP, HTTPS, VPN, etc. Never use public FTP servers. All remote access to IT systems should be secure and encrypted.
Data in Use: Data in OLTP databases that is created and updated regularly. E.g. Online data entry using portals, data entry in excel sheets, data in generated reports, etc.
Data that is archived: Data could be achieved either in an online archive or offline archive. Need to protect the data as per the privacy requirements. Here is the link to an interesting HIPAA violation incident.

Besides Data Security, most of these Regulatory Acts also cover rules around physical security, network security, etc.

Tuesday, May 21, 2013

Keeping pace with technology innovations in the Travel industry

I have been following the following 2 sites for the last few months to keep myself updated on the interesting trends happening in the Travel Industry - especially around technology innovations.

http://www.tnooz.com/
http://www.travopia.com/

Found another interesting strategy presentation on the Thomas Cook website that is worth a perusal. The group seems to be on track to deliver results based on a sound business strategy. Found their strategy of exclusive 'Concept Hotels' quite intriguing.

Friday, April 19, 2013

Portals vs Web Apps

I have often debated on the real value of Portal servers (and the JSR 268 portlet specification). IMHO, portal development should be as simple and clean as possible and I personally have always found designing and developing portlets to be very complex comparatively.

Kai Wähner has a good article on Dzone that challenges the so-called advantages of portal servers. Jotting down some of the excerpts from the article and also sharing my thoughts.
Let's start by dissecting the advantages of portals one-by-one.

SSO: With so many proven solutions and open standards for SSO, I think there is little value is utilizing the SSO capabilities of a portal server.
Aggregation of multiple applications on a single page: This can easily be achieved using iFrames or any other MashUp technology. For e.g. In SharePoint, we have a page-viewer web part that renders any remove web page as a IFrame.
Uniform appearance: Just need a good CSS3 developer to create some good style-sheets. Also all web application frameworks have the concept of Master Page and page templates.
Personalization: Depending on the complexity of personalization, we can achieve it using role based APIs or some custom development.
Drag and Drop Panels: Again easily done using JQuery UI widgets (pure-javascript). Just check out the cool http://gridster.net/
Unified Dashboard: Again can be done using IFrames or JS components from Ext-JS or JQuery.

Hence I feel we really need to think hard and ask the right questions before we blindly jump on the portal bandwagon and spend millions of dollars on commercial portal servers.
This link also lists down some questions that are handy during the decision making process.

Marketing folks of portal servers often tout on the personalization features of Portal servers. I would like to remind them that the most personalized website in the world - "Facebook" runs on PHP :)

Tuesday, April 16, 2013

Web Application Performance Optimization Tips

The following link on Yahoo Developer network on Web App Performance is timeless ! I remember having used these techniques around 8 yrs ago and all of them are still valid. A must read for any web application architect.
http://developer.yahoo.com/performance/rules.html

Another cool utility provided by Yahoo for optimizing image file size on a web page is "Smush.it".
Just upload any image to this site and it would optimize the image size and allow the new image file to be downloaded for your use.

Tuesday, April 02, 2013

Ruminating on Availability and Reliability

High availability is a function of both hardware + software combined. In order to design a highly available infrastructure, we have to ensure that all the components are made highly available and not just the database or app servers. This includes the network switches, SSO servers, power supply, etc.

The availability of each component is calculated and then we typically multiply the availabilities of all components together to get the overall availability, usually expressed as a percentage.

Common patterns for high availability are: Clustering & load-balancing, data replication (near real time), warm standby servers, effective DR strategy, etc. From an application architecture perspective availability would depend on effective caching, memory management, hardened security mechanisms, etc.

Application downtime occurs not just because of hardware failures, but could be due to lack of adequate testing (including unit testing, integration testing, performance testing, etc.) It's also very important to have proper monitoring mechanisms in place to proactively detect failures, performance issues, etc.

So how is availability typically measured? It is expressed as a percentage; for e.g. 99.9% availability.
To calculate the availability of a component, we need to understand the following 2 concepts:

Mean Time Between Failure (MTBF): It is defined as the average length of time the application runs before failing. Formula: Total Hours Ran / No. of failures (count)

Mean Time To Recovery (MTTR): It is defined as the average length of time needed to repair and restore service after a failure. Formula: Hours spend on repair / Failure Count

Formula: Availability = (MTBF / (MTBF + MTTR)) X 100

Using the above formula, we get the following percentages:

3 nines (99.9% availability) represents about ~ 9 hours of service outage in a single year.
4 nines (99.99% availability) come to ~ 1 hour of outage in a year.
5 nines (99.999% availability) represents only about 5 minutes of outage per year.

Monday, March 25, 2013

Ruminating on Server Side Push Techniques

The ability to push data from the server to the web browser has always been a pipe-dream for many web architects. Jotting down the various techniques that I used in the past and the new technologies on the horizon that would enable server side push.

Long Polling (Comet): For the last few years, this technique has been most popular and is used behind the scenes by multiple ajax frameworks, such as DoJo, WebSphere Ajax toolkit, etc. The fundamental concept behind this technique is for the server to hold on to the request and not respond till there is some data. Once the data is ready, push the data to the browser as the HTTP Response. After getting the response, the client would again make a new poll request and wait for the response. Hence the term - "long polling".
Persistent Connections / Incomplete Response: Another technique in which the server never ends the response stream, but always keeps it open. There is a special MIME type called multipart/x-mixed-replace, that is supported by most browsers expect IE :) This MIME type enables the server to keep the response stream open and send data in deltas to the browser.
HTML 5 WebSockets: The new HTML 5 specification brings to us the power of WebSockets that enable full-duplex bidirectional data flow between browsers and servers. Very soon, we would have all browsers/servers supporting this.

Monday, March 18, 2013

What is the actual service running behind svchost.exe

I knew that a lot of Windows Services (available as DLLs) run under the host process svchost.exe during start-up. But is there any way to find out what is the actual mapping service behind each svchost.exe? Sometimes, the svchost.exe process occupies a lot of CPU/memory resources and we need to know the actual service behind it.

The answer is very simple on Windows 7. Press "Ctr+Shift+Esc" to open the Task Manager.
Click on "Show processes from all users". Just right click on any svchost.exe process and in the context menu, select "Go To Service". You would be redirected to the Services Tab, wherein the appropriate service would be highlighted.

Another nifty way is to use the following command on the cmd prompt:
tasklist /svc /fi "imagename eq svchost.exe"

Tuesday, March 12, 2013

Behind the scenes..using OAuth

Found the following cool article on the web that explains how OAuth works behind the scenes..
http://marktrapp.com/blog/2009/09/17/oauth-dummies

OAuth 2.0 is essentially an authentication & authorization framework that enables a third-party application to obtain limited access to any HTTP service (web application or web service). It essentially is a protocol specification (a token-passing mechanism) that allows users to control which applications have access to their data without revealing their passwords or other credentials.Thus it can also be used for delegated authentication as mentioned here.

OAuth is also very useful when you are exposing APIs that third party applications may use. For e.g. all Google APIs can now be accessed using OAuth 2.0 protocol specification. In fact, for web-sites and mobile apps running on Android/iOS, Google has released a solution called as Google+ Sign-In for delegating authentication to Google. More information is available here:
https://developers.google.com/accounts/docs/OAuth2

The basic steps for any application to use OAuth is to first register/create a Client ID (client key) on the OAuth Authorization Server (e.g. Google, Facebook) along with a secret. (This is the crux of the solution, which I had missed in my earlier understanding :) Since the application is registered with the Service Provider, it can make requests now for access to services.) Then create a request token that would be authorized. Finally create a new pair of access tokens that would be used to access the services.
To understand these concepts, Google has also made a cool web app called OAuth PlayGround, where developers can play around with OAuth requests.

A good illustration for OAuth is provided on the Magento website here.

Thursday, March 07, 2013

Jigsaw puzzles in PowerPoint and Visio

Found this cool tutorial on the web that can be used to make jigsaw puzzles in PowerPoint or Visio. One of my friends actually used this technique to create a good visualization on technology building blocks.

http://www.wiseowl.co.uk/blog/s132/how_to_draw_jigsaw_puzzle_shapes_in_microsoft_powerpoint_pt1.htm

Monday, March 04, 2013

Long file names on Windows

Just spend the last 30 mins in total frustration on the way Windows 7 handles long file names. I was essentially trying to copy "LifeRay Social Office" portal folder structure from one location to the other.

On my Windows 7 desktop, the copy command from Explorer won't just work ! No error message, no warning, just that the window disappears. I did a remote desktop to the server and tried to copy from there. On the Windows Server 2000 box, I atleast got an error message - "Cannot copy file". But that's it, no information on why copy did not work.

I debugged further and tried to copy each individual file and only then did I get a meaningful error message - "The file Name(s) would be too long for the destination folder." So essentially the total path (string-length) of the files were long enough for Windows to go awry.

A quick google search showed that this is a core windows problem. Windows Explorer (File Explorer in Windows 8 and Windows Server 2012) uses ANSI_API calls which is limited to 260 characters in the paths. There are some hot fixes available as a patch on windows, but did not try them yet.

So what are the options then? MS has released a tool called as RoboCopy that can handle this problem. Another popular tool is LongPathTool. In my case, fortunately I had JDK installed on my box. I used the jar command to inflate/deflate the folder structure between copying and it worked like a charm :) Strangely WinZip on Windows 7 did not work as it threw some weird error long file names.

There is another headache due to long file names. You also cannot delete such directories form the windows explorer !. I tried using the rmdir command from the command prompt and thankfully that worked !!!

Monday, February 11, 2013

Ruminating on Visualization Techniques

The following link contains a good illustration of the various kinds of visualization techniques one can use to communicate ideas or clarify the business value behind the data.

http://www.visual-literacy.org/periodic_table/periodic_table.html

We are also experimenting with a new cool JS library called D3.js. Some pretty good visualization samples are available here.

This library can be used for basic charting and also can be used for impressive visualizations. We found this tutorial to be invaluable in understanding the basics of D3.

Anscombe's Quartet

We often use statistical properties such as "average", "mean", "variance", "std. deviation" during performance measurement of applications/services. Recently a friend of mine pointed out that only relying on calculated stats can be quite misleading. He pointed me to the following article on Wikipedia.

http://en.wikipedia.org/wiki/Anscombe's_quartet

Anscombe's quartet comprises four datasets that have nearly identical simple statistical properties, yet appear very different when graphed.
By just looking at the data sets, its impossible to predict that the graphs would be so different. Only when we plot the data points on a graph, we can see the way the data behaves. Another testimony to the power of data visualization !

Useful Command Line Tools for Windows 7

Jotting down some Win 7 commands that I often use for quick access to information. One can directly type these commands in the 'Run' window.

msconfig: Get to know the programs that are configured to start during boot. Disable those programs that you are not interested in.
msinfo32: Quick summary of your system information. Give detailed info on hardware resources, OS details, system properties, etc.
control: Quick access to the Control Panel
eventvwr: Quick access to the Event Viewer
perfmon: Useful tool to monitor the performance of your system using performance counters.
resmon: Great tool to check out the resource utilization of CPU, Memory and Disk IO.
taskmgr: Quick access to Task Manager
cmd: Opens the command prompt
inetcpl.cpl : Opens the internet settings for proxy, security etc.

Ruminating on Big Data

Came across an interesting infodeck on Big Data by Martin Fowler. There is a lot of hype around Big Data and there are tens of pundits defining Big Data in their own terms :) IMHO, right now we are at the "peak of inflated expectations" and "height of media infatuation" in the hype cycle.

But I agree with Martin on the fact that there is considerable fire behind the smoke. Once the hype dies down, folks would realize that we don't need another fancy term, but actually need to rethink about the basic principles of data-management.

There are 3 fundamental changes that would drive us to look beyond our current understanding around Data Management.

Volume of Data: Today the volume of data is so huge, that traditional data management techniques of creating a centralized database system is no longer feasible. Grid based distributed databases are going to become more and more common.
Speed at which Data is growing: Due to Web 2.0, explosion in electronic commerce, Social Media, etc. the rate at which data (mostly user generated content) is growing is unprecedented in the history of mankind. According to Eric Schmidt (Google CEO), every two days now we create as much information as we did from the dawn of civilization up until 2003. Walmart is clocking 1 million transactions per hour and Facebook has 40 billion photos !!! This image would give you an idea on the amount of Big Data generated during the 2012 Olympics.
Different types of data: We no longer have the liberty to assume that all valuable data would be available to us in a structured format - well defined using some schema. There is going to be a huge volume of unstructured data that needs to be exploited. For e.g. emails, application logs, web click stream analysis, messaging events, etc.

These 3 challenges of data are also popularly called as the 3 Vs of Big Data (volume of data, velocity of data and variety of data). To tackle these challenges, Martin urges us to focus on the following 3 aspects:

Extraction of Data: Data is going to come from a lot of structured and unstructured sources. We need new skills to harvest and collate data from multiple sources. The fundamental challenge would be to understand how valuable some data could be? How do we discover such sources of data?
Interpretation of Data: Ability to separate the wheat from the chaff. What data is pure noise? How to differentiate between signal and noise? How to avoid probabilistic illusions?
Visualization of Data: Usage of modern visualization techniques that would make the data more interactive and dynamic. Visualization can be simple with good usability in mind.

As this blog entry puts it in words - "Data is the new oil ! Data is just like crude. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value."

NoSQL databases are also gaining popularity. Application architects would need to consider polyglot persistence for datasets having different characteristics. For e.g. columnar data stores (aggregate oriented), graph databases, key-value stores, etc.