Tuesday, May 21, 2013

Keeping pace with technology innovations in the Travel industry

I have been following the following 2 sites for the last few months to keep myself updated on the interesting trends happening in the Travel Industry - especially around technology innovations.

http://www.tnooz.com/
http://www.travopia.com/

Found another interesting strategy presentation on the Thomas Cook website that is worth a perusal. The group seems to be on track to deliver results based on a sound business strategy. Found their strategy of exclusive 'Concept Hotels' quite intriguing. 

Friday, April 19, 2013

Portals vs Web Apps

I have often debated on the real value of Portal servers (and the JSR 268 portlet specification). IMHO, portal development should be as simple and clean as possible and I personally have always found designing and developing portlets to be very complex comparatively.

Kai Wähner has a good article on Dzone that challenges the so-called advantages of portal servers. Jotting down some of the excerpts from the article and also sharing my thoughts.
Let's start by dissecting the advantages of portals one-by-one.

  • SSO:  With so many proven solutions and open standards for SSO, I think there is little value is utilizing the SSO capabilities of a portal server.
  • Aggregation of multiple applications on a single page: This can easily be achieved using iFrames or any other MashUp technology. For e.g. In SharePoint, we have a page-viewer web part that renders any remove web page as a IFrame.
  • Uniform appearance: Just need a good CSS3 developer to create some good style-sheets. Also all web application frameworks have the concept of Master Page and page templates.
  • Personalization: Depending on the complexity of personalization, we can achieve it using role based APIs or some custom development. 
  • Drag and Drop Panels: Again easily done using JQuery UI widgets (pure-javascript). Just check out the cool http://gridster.net/
  •  Unified Dashboard: Again can be done using IFrames or JS components from Ext-JS or JQuery

Hence I feel we really need to think hard and ask the right questions before we blindly jump on the portal bandwagon and spend millions of dollars on commercial portal servers.
This link also lists down some questions that are handy during the decision making process.

Marketing folks of portal servers often tout on the personalization features of Portal servers. I would like to remind them that the most personalized website in the world - "Facebook" runs on PHP :)

Tuesday, April 16, 2013

Web Application Performance Optimization Tips

The following link on Yahoo Developer network on Web App Performance is timeless ! I remember having used these techniques around 8 yrs ago and all of them are still valid. A must read for any web application architect.
http://developer.yahoo.com/performance/rules.html

Another cool utility provided by Yahoo for optimizing image file size on a web page is "Smush.it".
Just upload any image to this site and it would optimize the image size and allow the new image file to be downloaded for your use. 

Tuesday, April 02, 2013

Ruminating on Availability and Reliability

High availability is a function of both hardware + software combined. In order to design a highly available infrastructure, we have to ensure that all the components are made highly available and not just the database or app servers. This includes the network switches, SSO servers, power supply, etc.

The availability of each component is calculated and then we typically multiply the availabilities of all components together to get the overall availability, usually expressed as a percentage.

Common patterns for high availability are: Clustering & load-balancing, data replication (near real time), warm standby servers, effective DR strategy, etc. From an application architecture perspective availability would depend on effective caching, memory management, hardened security mechanisms, etc.

Application downtime occurs not just because of hardware failures, but could be due to lack of adequate testing (including unit testing, integration testing, performance testing, etc.) It's also very important to have proper monitoring mechanisms in place to proactively detect failures, performance issues, etc.

So how is availability typically measured? It is expressed as a percentage; for e.g. 99.9% availability.
To calculate the availability of a component, we need to understand the following 2 concepts:

Mean Time Between Failure (MTBF): It is defined as the average length of time the application runs before failing. Formula: Total Hours Ran / No. of failures (count)

Mean Time To Recovery (MTTR): It is defined as the average length of time needed to repair and restore service after a failure. Formula: Hours spend on repair / Failure Count

Formula: Availability = (MTBF / (MTBF + MTTR)) X 100

Using the above formula, we get the following percentages:

3 nines (99.9% availability) represents about ~ 9 hours of service outage in a single year. 
4 nines (99.99% availability) come to ~ 1 hour of outage in a year. 
5 nines (99.999% availability) represents only about 5 minutes of outage per year.

Monday, March 25, 2013

Ruminating on Server Side Push Techniques

The ability to push data from the server to the web browser has always been a pipe-dream for many web architects. Jotting down the various techniques that I used in the past and the new technologies on the horizon that would enable server side push.
  • Long Polling (Comet): For the last few years, this technique has been most popular and is used behind the scenes by multiple ajax frameworks, such as DoJo, WebSphere Ajax toolkit, etc. The fundamental concept behind this technique is for the server to hold on to the request and not respond till there is some data. Once the data is ready, push the data to the browser as the HTTP Response. After getting the response, the client would again make a new poll request and wait for the response. Hence the term - "long polling". 
  • Persistent Connections / Incomplete Response: Another technique in which the server never ends the response stream, but always keeps it open. There is a  special MIME type called multipart/x-mixed-replace, that is supported by most browsers expect IE :) This MIME type enables the server to keep the response stream open and send data in deltas to the browser.
  • HTML 5 WebSockets: The new HTML 5 specification brings to us the power of WebSockets that enable full-duplex bidirectional data flow between browsers and servers. Very soon, we would have all browsers/servers supporting this. 

Monday, March 18, 2013

What is the actual service running behind svchost.exe

I knew that a lot of Windows Services (available as DLLs) run under the host process svchost.exe during start-up. But is there any way to find out what is the actual mapping service behind each svchost.exe? Sometimes, the svchost.exe process occupies a lot of CPU/memory resources and we need to know the actual service behind it.

The answer is very simple on Windows 7. Press "Ctr+Shift+Esc" to open the Task Manager.
Click on "Show processes from all users". Just right click on any svchost.exe process and in the context menu, select "Go To Service". You would be redirected to the Services Tab, wherein the appropriate service would be highlighted.

Another nifty way is to use the following command on the cmd prompt:
tasklist /svc /fi "imagename eq svchost.exe"

Tuesday, March 12, 2013

Behind the scenes..using OAuth

Found the following cool article on the web that explains how OAuth works behind the scenes..
http://marktrapp.com/blog/2009/09/17/oauth-dummies

OAuth 2.0 is essentially an authentication & authorization framework that enables a third-party application to obtain limited access to any HTTP service (web application or web service). It essentially is a protocol specification (a token-passing mechanism) that allows users to control which applications have access to their data without revealing their passwords or other credentials.Thus it can also be used for delegated authentication as mentioned here.

OAuth is also very useful when you are exposing APIs that third party applications may use. For e.g. all Google APIs can now be accessed using OAuth 2.0 protocol specification. In fact, for web-sites and mobile apps running on Android/iOS, Google has released a solution called as Google+ Sign-In for delegating authentication to Google. More information is available here:
https://developers.google.com/accounts/docs/OAuth2

The basic steps for any application to use OAuth is to first register/create a Client ID (client key) on the OAuth Authorization Server (e.g. Google, Facebook) along with a secret. (This is the crux of the solution, which I had missed in my earlier understanding :) Since the application is registered with the Service Provider, it can make requests now for access to services.) Then create a request token that would be authorized. Finally create a new pair of access tokens that would be used to access the services.
To understand these concepts, Google has also made a cool web app called OAuth PlayGround, where developers can play around with OAuth requests.

A good illustration for OAuth is provided on the Magento website here

Thursday, March 07, 2013

Jigsaw puzzles in PowerPoint and Visio

Found this cool tutorial on the web that can be used to make jigsaw puzzles in PowerPoint or Visio. One of my friends actually used this technique to create a good visualization on technology building blocks.

http://www.wiseowl.co.uk/blog/s132/how_to_draw_jigsaw_puzzle_shapes_in_microsoft_powerpoint_pt1.htm

Monday, March 04, 2013

Long file names on Windows

Just spend the last 30 mins in total frustration on the way Windows 7 handles long file names. I was essentially trying to copy "LifeRay Social Office" portal folder structure from one location to the other.

On my Windows 7 desktop, the copy command from Explorer won't just work ! No error message, no warning, just that the window disappears. I did a remote desktop to the server and tried to copy from there. On the Windows Server 2000 box, I atleast got an error message - "Cannot copy file". But that's it, no information on why copy did not work.

I debugged further and tried to copy each individual file and only then did I get a meaningful error message - "The file Name(s) would be too long for the destination folder." So essentially the total path (string-length) of the files were long enough for Windows to go awry.

A quick google search showed that this is a core windows problem. Windows Explorer (File Explorer in Windows 8 and Windows Server 2012) uses ANSI_API calls which is limited to 260 characters in the paths. There are some hot fixes available as a patch on windows, but did not try them yet.

So what are the options then? MS has released a tool called as RoboCopy that can handle this problem. Another popular tool is LongPathTool. In my case, fortunately I had JDK installed on my box. I used the jar command to inflate/deflate the folder structure between copying and it worked like a charm :) Strangely WinZip on Windows 7 did not work as it threw some weird error long file names.

There is another headache due to long file names. You also cannot delete such directories form the windows explorer !. I tried using the rmdir command from the command prompt and thankfully that worked !!!

Monday, February 11, 2013

Ruminating on Visualization Techniques

The following link contains a good illustration of the various kinds of visualization techniques one can use to communicate ideas or clarify the business value behind the data.

http://www.visual-literacy.org/periodic_table/periodic_table.html

We are also experimenting with a new cool JS library called D3.js. Some pretty good visualization samples are available here.

This library can be used for basic charting and also can be used for impressive visualizations. We found this tutorial to be invaluable in understanding the basics of D3. 

Anscombe's Quartet

We often use statistical properties such as "average", "mean", "variance", "std. deviation" during performance measurement of applications/services. Recently a friend of mine pointed out that only relying on calculated stats can be quite misleading. He pointed me to the following article on Wikipedia.

http://en.wikipedia.org/wiki/Anscombe's_quartet

Anscombe's quartet comprises four datasets that have nearly identical simple statistical properties, yet appear very different when graphed.
By just looking at the data sets, its impossible to predict that the graphs would be so different. Only when we plot the data points on a graph, we can see the way the data behaves. Another testimony to the power of data visualization !

Useful Command Line Tools for Windows 7

Jotting down some Win 7 commands that I often use for quick access to information. One can directly type these commands in the 'Run' window.

  • msconfig: Get to know the programs that are configured to start during boot. Disable those programs that you are not interested in.
  • msinfo32: Quick summary of your system information. Give detailed info on hardware resources, OS details, system properties, etc.
  • control: Quick access to the Control Panel
  • eventvwr: Quick access to the Event Viewer
  • perfmon: Useful tool to monitor the performance of your system using performance counters.
  • resmon: Great tool to check out the resource utilization of CPU, Memory and Disk IO.
  • taskmgr: Quick access to Task Manager
  • cmd: Opens the command prompt
  • inetcpl.cpl : Opens the internet settings for proxy, security etc. 

Ruminating on Big Data

Came across an interesting infodeck on Big Data by Martin Fowler. There is a lot of hype around Big Data and there are tens of pundits defining Big Data in their own terms :) IMHO, right now we are at the "peak of inflated expectations" and "height of media infatuation" in the hype cycle.

But I agree with Martin on the fact that there is considerable fire behind the smoke. Once the hype dies down, folks would realize that we don't need another fancy term, but actually need to rethink about the basic principles of data-management.

There are 3 fundamental changes that would drive us to look beyond our current understanding around Data Management.
  1. Volume of Data: Today the volume of data is so huge, that traditional data management techniques of creating a centralized database system is no longer feasible. Grid based distributed databases are going to become more and more common.
  2. Speed at which Data is growing: Due to Web 2.0, explosion in electronic commerce, Social Media, etc. the rate at which data (mostly user generated content) is growing is unprecedented in the history of mankind.  According to Eric Schmidt (Google CEO), every two days now we create as much information as we did from the dawn of civilization up until  2003. Walmart is clocking 1 million transactions per hour and Facebook has 40 billion photos !!! This image would give you an idea on the amount of Big Data generated during the 2012 Olympics. 
  3. Different types of data: We no longer have the liberty to assume that all valuable data would be available to us in a structured format - well defined using some schema. There is going to be a huge volume of unstructured data that needs to be exploited. For e.g. emails, application logs, web click stream analysis, messaging events, etc. 
These 3 challenges of data are also popularly called as the 3 Vs of Big Data (volume of data, velocity of data and variety of data). To tackle these challenges, Martin urges us to focus on the following 3 aspects:
  1. Extraction of Data: Data is going to come from a lot of structured and unstructured sources. We need new skills to harvest and collate data from multiple sources. The fundamental challenge would be to understand how valuable some data could be? How do we discover such sources of data?
  2. Interpretation of Data: Ability to separate the wheat from the chaff. What data is pure noise? How to differentiate between signal and noise? How to avoid probabilistic illusions?
  3. Visualization of Data: Usage of modern visualization techniques that would make the data more interactive and dynamic. Visualization can be simple with good usability in mind. 
As this blog entry puts it in words - "Data is the new oil ! Data is just like crude. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value."

NoSQL databases are also gaining popularity. Application architects would need to consider polyglot persistence for datasets having different characteristics. For e.g. columnar data stores (aggregate oriented), graph databases, key-value stores, etc. 

Tuesday, December 18, 2012

Generating Alphanumeric Random Strings

In one of my previous blog post, I had mentioned about the excellent Apache Commons RandomNumberGenerator utility class that is very handy for common use.

Sometimes, we need to generate an alphanumeric ramdom string for specific use cases. For e.g. file names, registry keys, etc. The Apache Commons module has one more class called RandomStringUtils for this.
If you are looking for a more simpler copy-n-paste code, then the following snippet should suffice for most non-secure requirements.

static final String AB = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
static Random rnd = new Random();

String randomString( int len ) 
{
   StringBuilder sb = new StringBuilder( len );
   for( int i = 0; i < len; i++ ) 
      sb.append( AB.charAt( rnd.nextInt(AB.length()) ) );
   return sb.toString();
}

Monday, December 03, 2012

Ruminating on Schematron and XML Validation

Schematron is a rule-based XML validation language. It can be used for making assertions on patterns in XML nodes. For e.g. if a person (element) has "Mr" as prefix (attribute), then gender (child element) should be "male". If the assertion fails, an error message that is supplied by the author of the schema can be displayed.

Thus Schematron is capable of expressing complex constraints that cannot be expressed using XML Schema or DTD. Using XML schemas, you can only put constraints on the document structure and basic datatype validation.  The following sites give good intro about this language.

 http://www.schematron.com/
 http://www.ldodds.com/papers/schematron_xsltuk.html

So how do we apply Schematron schema to validate XML document instances? Given below are the simple steps one should follow:
  1. Create a schematron schema file
  2. Use a meta-stylesheet to convert this schema file into a XSLT stylesheet. ( A meta-stylesheet is a stylesheet which generates other stylesheets)
  3. The above generated XSLT stylesheet can be used as a XML validator against the XML instance document (using a XSLT transformation engine)
  4. The output of the transformation would be a XML document with validation errors.
Since the fundamental technologies used as XML Schema, XSLT and XPath; most of the XML processing APIs of Java/.NET can be used for the same.

Difference between XSL, XSL-FO and XSLT

The XSL (Extensible StyleSheet Language) specification has a long history and has gone through multiple revisions. Many folks are confused between the difference between XSL, XSL-FO and XSLT.

Found a good link that explains the history of XSL - http://www.dpawson.co.uk/xsl/sect1/history.html
W3Schools has a good explanation - snippet below:

XSL-FO is Formally Named XSL !!  Why this confusion? Is XSL-FO and XSL the same thing?
Yes it is, but we will give you an explanation:
Styling is both about transforming and formatting information. When the World Wide Web Consortium (W3C) made their first XSL Working Draft, it contained the language syntax for both transforming and formatting XML documents.
Later, the Working Group at W3C split the original draft into separate Recommendations:

  •     XSLT, a language for transforming XML documents
  •     XSL or XSL-FO, a language for formatting XML documents
  •     XPath, a language for navigating in XML documents

Sunday, November 25, 2012

Free Data Modeling tools

There are a plethora of free and opensource UML modeling tools available in the market. But there is very little awareness regarding free database modeling tools. Jotting down a few free tools that are very good for ER modeling and support most popular databases for script generation.

  1. Oracle SQL Developer Data Modeler: This cool modeling tool from Oracle is actually free! It supports physical model creation for SQL Server, IBM DB2 and ofcourse Oracle DBs.
  2. DBDesigner: In the past, I have used this extensively when working with MySQL databases.
  3. MySQL Workbench: Another good tool if you are using MySQL.
  4. Aris Express: Software AG's popular Enterprise Architecture tool ARIS now comes with a express (community) edition that can be used for business process modeling and ER modeling.

Friday, November 02, 2012

UI Framework Classification

Today there are a plethora of frameworks and technologies available for create RIAs. Our customers often ask us to advice them on the best fit technology framework for their needs.

To help our customers, we have classified UI frameworks into the following broad categories:

1. Action based frameworks: These frameworks rely on the HTTP request-response paradigm and are designed around it. Many popular frameworks such as Struts, Spring MVC, ASP.NET MVC belong to this category. These frameworks typically implement the MVC design pattern and are light-weight in nature. They also have a easy learning curve for developers familiar with HTTP semantics.

2. Server Side UI component frameworks: These frameworks abstract away the HTTP protocol semantics and allow developers to work with UI components. The developer drags-n-drops UI controls on the page and writes event handling code for the components. Thus the paradigm is similar to thick client programming (e.g. VB, Power Builder). The most popular server side UI frameworks are JSF based open source projects such as ICEFaces, PrimeFaces, etc. Also classic ASP.NET is server side component based. These frameworks emit JavaScript and AJAX code during the rendering process.

3. Client Side UI component frameworks: In this category, the entire client is downloaded into the browser during the first request and then-after communication with the server is through AJAX requests. Either JSON or XML is the preferred data format for the payload. Examples of these frameworks are Flex/Flash, Ext-JS, MS Silverlight, etc.

Tuesday, October 30, 2012

Ruminating on Transaction Logs

Understanding the working of transaction logs for any RDBMS is very important for any application design. Found the following good articles that explain the important concepts in a simple language.

http://www.simple-talk.com/sql/learn-sql-server/managing-transaction-logs-in-sql-server/
http://www.techrepublic.com/article/understanding-the-importance-of-transaction-logs-in-sql-server/5173108

Any database consists of log files and data files. In the MS world, they are known as *.ldf and *.mdf files respectively. All database transactions (modifications) are first written to the log file. There is a separate thread (or bunch of threads) that writes from the buffer cache to the data file periodically.  Once data is written to a data-file, a checkpoint is written to the transaction log. This checkpoint is used as a reference to "roll forward" all transactions. i.e. All transactions after the last checkpoint are applied to the datafile when the server is restarted after a failure. This prevents transactions from being lost that were in the buffer but not yet written to the data file.

Transaction logs are required for rollback, log shipping, backup, etc. The transaction log files should be managed by a DBA or else we would run into problems if the log file fills up all the available hard disk space. The DBA should also periodically back-up the log files. The typical back-up commands also truncate the log files. In some databases, the truncation process just marks old records as inactive so they can be overwritten. In such cases, even after truncation, the size of the log file does not reduce and we may have to use different commands to compact or shrink the log files.

A good post on truncation options for SQL Server 2008 is given below:
http://www.codeproject.com/Articles/380879/About-transaction-log-and-its-truncation-in-SQL-Se

Wednesday, October 17, 2012

Peformance impact of TDE at database level

I was always under the opinion that column-level encryption is better and more performant than database or tablespace level encryption. But after much research and understanding the internal working on TDE (Transparent Data Encryption) on SQLServer and Oracle, it does not look to be a bad deal !

In fact, if we have a lot of columns that need to be encrypted and also need to fire queries against the encrypted columns, then a full database (tablespace) level encryption using TDE seems to be the best option.

I was a bit skeptical on the issue of performance degradation in using full database TDE, but it may not be so. First and foremost, column-level (cell) encryption can severely affect the database query optimization functions and result in significantly worse performance than encrypting the entire database.
When we use TDE at the database (tablespace) level, then the DB engine can use bulk encryption for entire blocks of data as they are written to or read from the disk.

It is important to note that full database TDE actually works at the data-file level and not at each table/column level. To put it in other words, the data is not encrypted but rather entire data files (index files, log files, etc.) are encrypted.

Microsoft states that the performance degradation of using database level TDE is a mere 3-6%.
Oracle states that in 11g, if we use Intel XEON processesor with AES instruction set, then there is a "near-zero" impact on database performance.

It is important to note the terminology differences regarding TDE used by Microsoft and Oracle. Microsoft refers to full database encryption as TDE (not column-level). Oracle calls it TDE-tablespace and TDE-column level.

Also TDE is a proven solution from a regulatory perspective - e.g. PCI. Auditors are more comfortable approving a proven industry solution that any custom logic that is implemented in application code.

Tuesday, October 16, 2012

Column Level Encryption in SQLServer

We were exploring the option of using cell-level or column-level encryption in SQL Server. The option of using TDE (Transaparent Data Encryption) was dismissed due to performance reasons and we just wanted to encrypt a few columns.

Found this nice tutorial that quickly explains how to create a symmetric key for SQL Server encryption. Excerpts from the blog:

1. Create a certificate that would be used to encrypt our symmetric key.
CREATE CERTIFICATE MyCertificateName
WITH SUBJECT = 'A label for this certificate'
  
2. Create a symmetric key by giving a passphrase (KEY_SOURCE) and GUID seed (IDENTITY_VALUE).
CREATE SYMMETRIC KEY MySymmetricKeyName WITH
IDENTITY_VALUE = 'a fairly secure name',
ALGORITHM = AES_256,
KEY_SOURCE = 'a very secure strong password or phrase'
ENCRYPTION BY CERTIFICATE MyCertificateName;
 
To ensure we can replicate the key on another server, or rebuild the key if it is corrupted, you must very safely keep note of the KEY_SOURCE and IDENTITY_VALUE parameters, as these are what is used to create the key. These can be used to regenerate the key.

3. Encrypt the data
EncryptByKey(Key_GUID(‘MySymmetricKeyName’), @ValueToEncrypt)  

4. Decrypt the data
DecryptByKey(@ValueToDecrypt)
 
The only parameter that the decrypt function needs is the data you wish to decrypt. We do not need to pass the key name to the decryption function, SQLServer will determine which open key needs to be used.

Monday, October 15, 2012

Where is the private key of a Digital Cert stored?

Today, one of my team members asked an innocuous question that whether the private key is stored in a digital certificate?

The answer is an obvious 'NO' and its unfortunate that so many folks still struggle to understand the basics of PKI. A digital certificate is nothing but a container for your public key and the digital signature of your public key (hashed and encrypted) by the CA's private key. More information is available here.

I think a lot of people get confused because the use-cases for encryption are different. For e.g. if I want to send sensitive data to a receiver, then I will encrypt the data with the public key of the receiver. But if I want to sign a document, I will use my private key to digitally sign the document.

This site contains a neat table that gives a good overview of what key is used when.


Key Function
Key Type
Whose Key Used
Encrypt data for a recipient
Public key
Receiver
Decrypt data received
Private key
Receiver
Sign data
Private key
Sender
Verify a signature
Public key
Sender


In the past, I have blogged about digital certs here that would be good for a quick perusal :-
http://www.narendranaidu.com/2007/09/formats-for-digital-certificates.html
http://www.narendranaidu.com/2009/06/creating-self-signed-certificate.html

So where is the private key stored? The private key is always stored securely on the server and never revealed to anyone. When digital certs are used for SSL (https) enablement on a server, then the server programs for cert management typically abstract the user from the challenges of manual key management as stated below:-

IIS Windows
On IIS, we use the web based console can be used to generate CSR (Certificate Signing Request). The private key is generated at the same time and stored securely by IIS. When we receive the digital cert from the CA, we open the "Pending Request" screen of the console. Here based on the attributes of the digital certificate, the same gets associated with the private key.

WebSphere
WebSphere 6.0 and earlier versions come with a tool called as iKeyman that is used to generate a private key file and a certificate request file. WebSphere 6.1 and later versions the web based admin console can be used to manage certs.

OpenSSL
OpenSSL also works with 2 files - the digital certificate (server.crt) and private key (server.key)
If we want to verify that a Private key matches a Certificate, then we can use the commands given here.

Monday, October 08, 2012

Random Number Generator Code

We were writing a common utility to help the development team is generating random numbers. The utility would allow the developer to choose between PRNG and TRNG. Also the developer can specify the range between which the random number needs to be generated.

We stopped our efforts mid-way, when we saw the excellent RandomDataGenerator class in the Apache commons Math library. This library has all the functions that would be required for most use-cases.

For e.g. to generate a random number between 1 and 1000 use the following method:
//not secure for cryptography, but super-fast
int value = randomNumberGenerator.nextInt(1,1000);
int value = randomNumberGenerator.nextSecureInt(1,1000); //secure

To generate random positive integers:
int value = randomNumberGenerator.nextInt(1,Integer.MAX_VALUE);

UUID vs SecureRandom in Java

Recently, one of my team members was delibrating on using the java.util.UUID class or the java.util.SecureRandom class for a use-case to generate a unique random number.

When we digged open the source code of UUID, we were suprised to see that it uses SecureRandom behind the scenes to create a 128-bit (16 bytes) GUID.
The second question was the probability of a collision using the UUID class. Wikipedia has a good discussion on this point available here. What is states is that - even if we were to generate 1 million random UUIDs a second, the chances of a duplicate occurring in our lifetime would be extremely small.
Another challenge is that to detect the duplicate, we will have to write a powerful algorighm running on a super-computer that would compare 1 million new UUIDs per second against all of the UUIDs that we have previously generated... :)

Java's UUID is a type-4 GUID, hence the first 6 bits are not random and used for type (2 bits) and version number (4 bits). So the period of UUID is 2 'raised to' 122 - enough for all practical uses..

Ruminating on random-ness

Anyone who has dabbled at cryptograpy would know the difference between a PRNG and a TRNG :)
PRNG - Psuedo Random Number Generator
TRNG - True Random Number Generator

So, what's the real fuss on randomness? And why is it important to understand how random-number generators work?

First and foremost, we have to understand that most random-number generators depend on some mathematical formula to derive a "random" number - based on an input (called as seed).
Since a deterministic algorithm is used to arrive at the random number, these numbers are called as "pseudo-random"; i.e. they appear random to the casual observer, but can be hacked.

For e.g. the typical algorithm used by the java.util.Random() class is illustrated here. Such a PRNG would always produce the same sequence of random numbers for a given seed - even if run on different computers. Hence if someone can guess the initial seed, then it would be possible to predict the next sequence of random numbers. The default constructor of Random() class uses the "system time" as the seed and hence this option is a little bit more secure that manually providing the 'seed'. But still it is  possible for an attacker to synchronize into the stream of such random numbers and therefore calculate all future random numbers!

So how can we generate true random numbers? There are multiple options here:

1) Use a hardware random number generator. Snippet from Wikipedia:

"Such devices are often based on microscopic phenomena that generate a low-level, statistically random 'noise' signal, such as thermal noise or the photoelectric effect or other quantum phenomena. 
A hardware random number generator typically consists of a transducer to convert some aspect of the physical phenomena to an electrical signal, an amplifier and other electronic circuitry to increase the amplitude of the random fluctuations to a macroscopic level, and some type of analog to digital converter to convert the output into a digital number, often a simple binary digit 0 or 1. 
By repeatedly sampling the randomly varying signal, a series of random numbers is obtained"

2) Use cryptographically secure random number generators (TRNG): These number generators collect entropy from various inputs that are truly unpredictable. For e.g. on Linux, /dev/random works in a sophisticated manner to capture hardware interrupts, CPU clock speeds, network packets, user inputs from keyboard, etc. to arrive at a truly random seed.
On Windows, many parameters such as process ID, thread ID, the system clock, the system time, the system counter, memory status, free disk clusters, the hashed user environment bloc, etc. are used to seed the PRNG.

On the Java platform, it is recommended to use the java.security.SecureRandom class, as it delegates the entropy finding to a CSP (Cryptography Service Provider) that typically delegates the calls to the OS.

3) Use a third-party webservice that would return true random numbers. For e.g. http://www.random.org/ would provide you with random numbers generated from 'atmospheric noise'.
The site Hotbits would provide you with random numbers derived from unpredictable radioactive decay.

Wednesday, October 03, 2012

Troubleshooting XML namespaces binding in SOAP request using JAXB

Recently helped a team resolve an issue regarding namespace handling in JAX-WS / CXF. Jotting down the solution as it might help other folks breaking their heads on this issue :)

Our Java application needed to consume a .NET webservice and were facing challenges in SOA interoperability. We created the client stubs using the WSDL2Java tool of CXF. The WSDL had - elementFormDefault="qualified"

The problem we ran across was that the complex types were all being returned as 'null'. Enabling the network sniffer, we checked the raw SOAP reponse reaching the client. The response was OK with all fields populated. Hence the real issue was in the data-binding that was happening to the Java objects.

A quick google search revealed that the default behavior of JAXB when it encounters marshelling/unmarshalling errors is to ingore the exception and put it as a warning messages !!. This was a shock, as there was no way to debug the issue on the server.

We then wrote a sample JAXB application and tried to un-marshall the XML/SOAP message. It is here that we started getting the following error stack:
org.apache.cxf.interceptor.Fault: Unmarshalling Error: unexpected element

Hence, it was proved that the real culprit was the JAXB unmarshaller that was not generating the Java classes with the correct namespace.

First we checked the generated Java classes and found a file called as "package-info.java" that had a XML namespace 'annotated' on the package name - essentially a package annotation. So this was supposed to work, but why is the unmarshaller throwing an exception?

We tried adding the following attributes to the annotation and then the unmarshaller started working !!!
@javax.xml.bind.annotation.XmlSchema (
    namespace = "http://com.mypackage",
    elementFormDefault = javax.xml.bind.annotation.XmlNsForm.QUALIFIED,
    attributeFormDefault = javax.xml.bind.annotation.XmlNsForm.UNQUALIFIED

  ) 


Not sure why the WSDL2Java tool did not add these automatically based on the elementFormDefault="qualified" present in the WSDL.  But this trick worked and we could consume the .NET webservice. We had to modify the build script to replace the "package-info.java" file everytime it was recreated.

Another option is to manually add the "namespace" attribute to all XMLTypes in the generated Java classes; but this is very tedious.

Tuesday, September 25, 2012

Ruminating on Hibernate Entities and DTO

Today we had a lengthy debate on the hot topic of Hibernate entities vs DTO to be used across layers of a n-tiered application. Summarizing the key concepts that were clarified during our brainstorming :)

Q. Is is possible to use Hibernate entities in SOA style web services?

Ans. Yes. It is possible, but with a few important caveats and serious disadvantages.

If you are using Hibernate annotations to decorate your entities, then it would mandate that your service clients also have the necessary Hibernate/JPA jar files. This can be a big issue if you have non-Java clients (e.g. a .NET webservice client). If you use a mapping file (*.xml), then you are good to go as then there are no dependencies on Hibernate jars. But any change in your data model will affect your entities and this will result in changes to all your webservice consumers. So a lot of tight coupling :(

Also you have to understand that Hibernate has used Javaassist to create new dynamic proxies of your Hibernate entities. So the entities that you are referencing are actually that of the generated dynamic proxy - that contains a reference to the actual entity object. So U would need to deproxy the entity and then use it - either using dozer or a similar mapping library.

Person cleanPerson = mapper.map(person, Hibernate.getClass(person));

Note: In the above code, Hibernate.getClass() returns the actual entity class of the proxy object. 

If you are using lazy loading (which most applications do), then you might encounter the famous "LazyInitializationException". This occurs when you detact the Hibernate entity and the serialization process would trigger the lazy loading for a property. To understand this exception, you need to understand the concept of sessions and transactions as given here.
If you do not have lazy loading and do a eager load of all your entities, then it would work - but this is only a short term solution that may work only for small applications.


Q. Ok. I understand the pitfalls. So if I go with DTO, will my problems be solved?

Ans. Yes, DTO is the best option, but you may encounter a few issues again.

Many developers encounter the "LazyInitializationException" when they try to use libraries such as a Dozer to copy properties. This happens because Dozer uses reflection behind the scenes to copy properties and again attempts to access uninitialized lazy collections. There are 2 ways to resolve this problem. Use a custom field mapper as shown here or use the latest version of Dozer that has a HibernateProxyResolver class to get the real entity class behind the Javaassist proxy. This is explained in the proxy-handling section of Dozer site.

Annotations are used at compile-time or runtime?

There is a lot of confusion among folks on the scope of annotations - whether annotations are used only at compile-time or also at runtime?

The answer is that it depends on the RETENTION POLICY of the annotation. An annotation can have one of the three retention policies.

RetentionPolicy.SOURCE: Available only during the compile phase and not at runtime. So they aren't written to the bytecode. Example: @Override, @SuppressWarnings

RetentionPolicy.CLASS: Available in the *.class file, but discarded by the JVM during class loading. Useful when doing bytecode-level post-processing. But not available at runtime.

RetentionPolicy.RUNTIME: Available at runtime in the JVM loaded class. Can be assessed using reflection at runtime. Example: Hibernate / Spring annotations.

Serializing JPA or Hibernate entities over the wire

In my previous post, we discussed about the need to map properties between Hibernate entities and DTOs. This is required, because Hibernate instuments the byte-code of the Java entity classes using tools such as JavaAssist.

Many developers often encounter the LazyInitializationException of Hibernate when you try to use the Hibernate/JPA entities in your webservices stack. To enable you to serialize the same Hibernate entites across the wire as XML, we have 2 options -
  1. Use the entity pruner library - This library essentially removes all the hibernate dependencies from your entity classes by pruning them. Though this library is quite stable, developers should be careful about when to prune and unprune.
  2. Use AutoMapper libraries - Using libraries such as Dozer makes it very easy to copy properties from the domain hibernate entity to a DTO class. For e.g.
PersonDTO cleanPerson = mapper.map(person, PersonDTO.class);

Out of the above 2 approaches, I would recommed to use the second one as it is clean and easy to use. It also enforces clean separation of concerns. A good discussion on StackOverFlow is available here.

Friday, September 21, 2012

How RSA Protected Configuration Provider works behind the scenes?

We were using the "RSA Protected Configuration provider" to encrypt sensitive information in our config files. I was suprised to see that the generated config file also had a triple-DES encrypted key.

So that means the config section is actually encrypted/decrypted using this symmetric key. But where is the key that has encrypted this key. It is here that the RSA public/private key pair come into picture. The public key in the RSA container is used to encrypt the DES key and the private key is used to decrypt the DES key. A good forum tread discussing this is available here.

There is also a W3C standard for XML encryption available here.