Tech Talk: 2012

Tuesday, December 18, 2012

Generating Alphanumeric Random Strings

In one of my previous blog post, I had mentioned about the excellent Apache Commons RandomNumberGenerator utility class that is very handy for common use.

Sometimes, we need to generate an alphanumeric ramdom string for specific use cases. For e.g. file names, registry keys, etc. The Apache Commons module has one more class called RandomStringUtils for this.
If you are looking for a more simpler copy-n-paste code, then the following snippet should suffice for most non-secure requirements.

static final String AB = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ";
static Random rnd = new Random();

String randomString( int len ) 
{
   StringBuilder sb = new StringBuilder( len );
   for( int i = 0; i < len; i++ ) 
      sb.append( AB.charAt( rnd.nextInt(AB.length()) ) );
   return sb.toString();
}

Monday, December 03, 2012

Ruminating on Schematron and XML Validation

Schematron is a rule-based XML validation language. It can be used for making assertions on patterns in XML nodes. For e.g. if a person (element) has "Mr" as prefix (attribute), then gender (child element) should be "male". If the assertion fails, an error message that is supplied by the author of the schema can be displayed.

Thus Schematron is capable of expressing complex constraints that cannot be expressed using XML Schema or DTD. Using XML schemas, you can only put constraints on the document structure and basic datatype validation. The following sites give good intro about this language.

http://www.schematron.com/
http://www.ldodds.com/papers/schematron_xsltuk.html

So how do we apply Schematron schema to validate XML document instances? Given below are the simple steps one should follow:

Create a schematron schema file
Use a meta-stylesheet to convert this schema file into a XSLT stylesheet. ( A meta-stylesheet is a stylesheet which generates other stylesheets)
The above generated XSLT stylesheet can be used as a XML validator against the XML instance document (using a XSLT transformation engine)
The output of the transformation would be a XML document with validation errors.

Since the fundamental technologies used as XML Schema, XSLT and XPath; most of the XML processing APIs of Java/.NET can be used for the same.

Difference between XSL, XSL-FO and XSLT

The XSL (Extensible StyleSheet Language) specification has a long history and has gone through multiple revisions. Many folks are confused between the difference between XSL, XSL-FO and XSLT.

Found a good link that explains the history of XSL - http://www.dpawson.co.uk/xsl/sect1/history.html
W3Schools has a good explanation - snippet below:

XSL-FO is Formally Named XSL !! Why this confusion? Is XSL-FO and XSL the same thing?
Yes it is, but we will give you an explanation:
Styling is both about transforming and formatting information. When the World Wide Web Consortium (W3C) made their first XSL Working Draft, it contained the language syntax for both transforming and formatting XML documents.
Later, the Working Group at W3C split the original draft into separate Recommendations:

XSLT, a language for transforming XML documents
XSL or XSL-FO, a language for formatting XML documents
XPath, a language for navigating in XML documents

Sunday, November 25, 2012

Free Data Modeling tools

There are a plethora of free and opensource UML modeling tools available in the market. But there is very little awareness regarding free database modeling tools. Jotting down a few free tools that are very good for ER modeling and support most popular databases for script generation.

Oracle SQL Developer Data Modeler: This cool modeling tool from Oracle is actually free! It supports physical model creation for SQL Server, IBM DB2 and ofcourse Oracle DBs.
DBDesigner: In the past, I have used this extensively when working with MySQL databases.
MySQL Workbench: Another good tool if you are using MySQL.
Aris Express: Software AG's popular Enterprise Architecture tool ARIS now comes with a express (community) edition that can be used for business process modeling and ER modeling.

Friday, November 02, 2012

UI Framework Classification

Today there are a plethora of frameworks and technologies available for create RIAs. Our customers often ask us to advice them on the best fit technology framework for their needs.

To help our customers, we have classified UI frameworks into the following broad categories:

1. Action based frameworks: These frameworks rely on the HTTP request-response paradigm and are designed around it. Many popular frameworks such as Struts, Spring MVC, ASP.NET MVC belong to this category. These frameworks typically implement the MVC design pattern and are light-weight in nature. They also have a easy learning curve for developers familiar with HTTP semantics.

2. Server Side UI component frameworks: These frameworks abstract away the HTTP protocol semantics and allow developers to work with UI components. The developer drags-n-drops UI controls on the page and writes event handling code for the components. Thus the paradigm is similar to thick client programming (e.g. VB, Power Builder). The most popular server side UI frameworks are JSF based open source projects such as ICEFaces, PrimeFaces, etc. Also classic ASP.NET is server side component based. These frameworks emit JavaScript and AJAX code during the rendering process.

3. Client Side UI component frameworks: In this category, the entire client is downloaded into the browser during the first request and then-after communication with the server is through AJAX requests. Either JSON or XML is the preferred data format for the payload. Examples of these frameworks are Flex/Flash, Ext-JS, MS Silverlight, etc.

Tuesday, October 30, 2012

Ruminating on Transaction Logs

Understanding the working of transaction logs for any RDBMS is very important for any application design. Found the following good articles that explain the important concepts in a simple language.

http://www.simple-talk.com/sql/learn-sql-server/managing-transaction-logs-in-sql-server/
http://www.techrepublic.com/article/understanding-the-importance-of-transaction-logs-in-sql-server/5173108

Any database consists of log files and data files. In the MS world, they are known as *.ldf and *.mdf files respectively. All database transactions (modifications) are first written to the log file. There is a separate thread (or bunch of threads) that writes from the buffer cache to the data file periodically. Once data is written to a data-file, a checkpoint is written to the transaction log. This checkpoint is used as a reference to "roll forward" all transactions. i.e. All transactions after the last checkpoint are applied to the datafile when the server is restarted after a failure. This prevents transactions from being lost that were in the buffer but not yet written to the data file.

Transaction logs are required for rollback, log shipping, backup, etc. The transaction log files should be managed by a DBA or else we would run into problems if the log file fills up all the available hard disk space. The DBA should also periodically back-up the log files. The typical back-up commands also truncate the log files. In some databases, the truncation process just marks old records as inactive so they can be overwritten. In such cases, even after truncation, the size of the log file does not reduce and we may have to use different commands to compact or shrink the log files.

A good post on truncation options for SQL Server 2008 is given below:
http://www.codeproject.com/Articles/380879/About-transaction-log-and-its-truncation-in-SQL-Se

Wednesday, October 17, 2012

Peformance impact of TDE at database level

I was always under the opinion that column-level encryption is better and more performant than database or tablespace level encryption. But after much research and understanding the internal working on TDE (Transparent Data Encryption) on SQLServer and Oracle, it does not look to be a bad deal !

In fact, if we have a lot of columns that need to be encrypted and also need to fire queries against the encrypted columns, then a full database (tablespace) level encryption using TDE seems to be the best option.

I was a bit skeptical on the issue of performance degradation in using full database TDE, but it may not be so. First and foremost, column-level (cell) encryption can severely affect the database query optimization functions and result in significantly worse performance than encrypting the entire database.
When we use TDE at the database (tablespace) level, then the DB engine can use bulk encryption for entire blocks of data as they are written to or read from the disk.

It is important to note that full database TDE actually works at the data-file level and not at each table/column level. To put it in other words, the data is not encrypted but rather entire data files (index files, log files, etc.) are encrypted.

Microsoft states that the performance degradation of using database level TDE is a mere 3-6%.
Oracle states that in 11g, if we use Intel XEON processesor with AES instruction set, then there is a "near-zero" impact on database performance.

It is important to note the terminology differences regarding TDE used by Microsoft and Oracle. Microsoft refers to full database encryption as TDE (not column-level). Oracle calls it TDE-tablespace and TDE-column level.

Also TDE is a proven solution from a regulatory perspective - e.g. PCI. Auditors are more comfortable approving a proven industry solution that any custom logic that is implemented in application code.

Tuesday, October 16, 2012

Column Level Encryption in SQLServer

We were exploring the option of using cell-level or column-level encryption in SQL Server. The option of using TDE (Transaparent Data Encryption) was dismissed due to performance reasons and we just wanted to encrypt a few columns.

Found this nice tutorial that quickly explains how to create a symmetric key for SQL Server encryption. Excerpts from the blog:

1. Create a certificate that would be used to encrypt our symmetric key.

CREATE CERTIFICATE MyCertificateName
WITH SUBJECT = 'A label for this certificate'

2. Create a symmetric key by giving a passphrase (KEY_SOURCE) and GUID seed (IDENTITY_VALUE).

CREATE SYMMETRIC KEY MySymmetricKeyName WITH
IDENTITY_VALUE = 'a fairly secure name',
ALGORITHM = AES_256,
KEY_SOURCE = 'a very secure strong password or phrase'
ENCRYPTION BY CERTIFICATE MyCertificateName;

To ensure we can replicate the key on another server, or rebuild the key if it is corrupted, you must very safely keep note of the KEY_SOURCE and IDENTITY_VALUE parameters, as these are what is used to create the key. These can be used to regenerate the key.

3. Encrypt the data

EncryptByKey(Key_GUID(‘MySymmetricKeyName’), @ValueToEncrypt)

4. Decrypt the data

DecryptByKey(@ValueToDecrypt)

The only parameter that the decrypt function needs is the data you wish to decrypt. We do not need to pass the key name to the decryption function, SQLServer will determine which open key needs to be used.

Monday, October 15, 2012

Where is the private key of a Digital Cert stored?

Today, one of my team members asked an innocuous question that whether the private key is stored in a digital certificate?

The answer is an obvious 'NO' and its unfortunate that so many folks still struggle to understand the basics of PKI. A digital certificate is nothing but a container for your public key and the digital signature of your public key (hashed and encrypted) by the CA's private key. More information is available here.

I think a lot of people get confused because the use-cases for encryption are different. For e.g. if I want to send sensitive data to a receiver, then I will encrypt the data with the public key of the receiver. But if I want to sign a document, I will use my private key to digitally sign the document.

This site contains a neat table that gives a good overview of what key is used when.

Key Function	Key Type	Whose Key Used
Encrypt data for a recipient	Public key	Receiver
Decrypt data received	Private key	Receiver
Sign data	Private key	Sender
Verify a signature	Public key	Sender

In the past, I have blogged about digital certs here that would be good for a quick perusal :-
http://www.narendranaidu.com/2007/09/formats-for-digital-certificates.html
http://www.narendranaidu.com/2009/06/creating-self-signed-certificate.html

So where is the private key stored? The private key is always stored securely on the server and never revealed to anyone. When digital certs are used for SSL (https) enablement on a server, then the server programs for cert management typically abstract the user from the challenges of manual key management as stated below:-

IIS Windows
On IIS, we use the web based console can be used to generate CSR (Certificate Signing Request). The private key is generated at the same time and stored securely by IIS. When we receive the digital cert from the CA, we open the "Pending Request" screen of the console. Here based on the attributes of the digital certificate, the same gets associated with the private key.

WebSphere
WebSphere 6.0 and earlier versions come with a tool called as iKeyman that is used to generate a private key file and a certificate request file. WebSphere 6.1 and later versions the web based admin console can be used to manage certs.

OpenSSL
OpenSSL also works with 2 files - the digital certificate (server.crt) and private key (server.key)
If we want to verify that a Private key matches a Certificate, then we can use the commands given here.

Monday, October 08, 2012

Random Number Generator Code

We were writing a common utility to help the development team is generating random numbers. The utility would allow the developer to choose between PRNG and TRNG. Also the developer can specify the range between which the random number needs to be generated.

We stopped our efforts mid-way, when we saw the excellent RandomDataGenerator class in the Apache commons Math library. This library has all the functions that would be required for most use-cases.

For e.g. to generate a random number between 1 and 1000 use the following method:
//not secure for cryptography, but super-fast
int value = randomNumberGenerator.nextInt(1,1000);
int value = randomNumberGenerator.nextSecureInt(1,1000); //secure

To generate random positive integers:
int value = randomNumberGenerator.nextInt(1,Integer.MAX_VALUE);

UUID vs SecureRandom in Java

Recently, one of my team members was delibrating on using the java.util.UUID class or the java.util.SecureRandom class for a use-case to generate a unique random number.

When we digged open the source code of UUID, we were suprised to see that it uses SecureRandom behind the scenes to create a 128-bit (16 bytes) GUID.
The second question was the probability of a collision using the UUID class. Wikipedia has a good discussion on this point available here. What is states is that - even if we were to generate 1 million random UUIDs a second, the chances of a duplicate occurring in our lifetime would be extremely small.
Another challenge is that to detect the duplicate, we will have to write a powerful algorighm running on a super-computer that would compare 1 million new UUIDs per second against all of the UUIDs that we have previously generated... :)

Java's UUID is a type-4 GUID, hence the first 6 bits are not random and used for type (2 bits) and version number (4 bits). So the period of UUID is 2 'raised to' 122 - enough for all practical uses..

Ruminating on random-ness

Anyone who has dabbled at cryptograpy would know the difference between a PRNG and a TRNG :)
PRNG - Psuedo Random Number Generator
TRNG - True Random Number Generator

So, what's the real fuss on randomness? And why is it important to understand how random-number generators work?

First and foremost, we have to understand that most random-number generators depend on some mathematical formula to derive a "random" number - based on an input (called as seed).
Since a deterministic algorithm is used to arrive at the random number, these numbers are called as "pseudo-random"; i.e. they appear random to the casual observer, but can be hacked.

For e.g. the typical algorithm used by the java.util.Random() class is illustrated here. Such a PRNG would always produce the same sequence of random numbers for a given seed - even if run on different computers. Hence if someone can guess the initial seed, then it would be possible to predict the next sequence of random numbers. The default constructor of Random() class uses the "system time" as the seed and hence this option is a little bit more secure that manually providing the 'seed'. But still it is possible for an attacker to synchronize into the stream of such random numbers and therefore calculate all future random numbers!

So how can we generate true random numbers? There are multiple options here:

1) Use a hardware random number generator. Snippet from Wikipedia:

"Such devices are often based on microscopic phenomena that generate a low-level, statistically random 'noise' signal, such as thermal noise or the photoelectric effect or other quantum phenomena.
A hardware random number generator typically consists of a transducer to convert some aspect of the physical phenomena to an electrical signal, an amplifier and other electronic circuitry to increase the amplitude of the random fluctuations to a macroscopic level, and some type of analog to digital converter to convert the output into a digital number, often a simple binary digit 0 or 1.
By repeatedly sampling the randomly varying signal, a series of random numbers is obtained"

2) Use cryptographically secure random number generators (TRNG): These number generators collect entropy from various inputs that are truly unpredictable. For e.g. on Linux, /dev/random works in a sophisticated manner to capture hardware interrupts, CPU clock speeds, network packets, user inputs from keyboard, etc. to arrive at a truly random seed.
On Windows, many parameters such as process ID, thread ID, the system clock, the system time, the system counter, memory status, free disk clusters, the hashed user environment bloc, etc. are used to seed the PRNG.

On the Java platform, it is recommended to use the java.security.SecureRandom class, as it delegates the entropy finding to a CSP (Cryptography Service Provider) that typically delegates the calls to the OS.

3) Use a third-party webservice that would return true random numbers. For e.g. http://www.random.org/ would provide you with random numbers generated from 'atmospheric noise'.
The site Hotbits would provide you with random numbers derived from unpredictable radioactive decay.

Wednesday, October 03, 2012

Troubleshooting XML namespaces binding in SOAP request using JAXB

Recently helped a team resolve an issue regarding namespace handling in JAX-WS / CXF. Jotting down the solution as it might help other folks breaking their heads on this issue :)

Our Java application needed to consume a .NET webservice and were facing challenges in SOA interoperability. We created the client stubs using the WSDL2Java tool of CXF. The WSDL had - elementFormDefault="qualified"

The problem we ran across was that the complex types were all being returned as 'null'. Enabling the network sniffer, we checked the raw SOAP reponse reaching the client. The response was OK with all fields populated. Hence the real issue was in the data-binding that was happening to the Java objects.

A quick google search revealed that the default behavior of JAXB when it encounters marshelling/unmarshalling errors is to ingore the exception and put it as a warning messages !!. This was a shock, as there was no way to debug the issue on the server.

We then wrote a sample JAXB application and tried to un-marshall the XML/SOAP message. It is here that we started getting the following error stack:
org.apache.cxf.interceptor.Fault: Unmarshalling Error: unexpected element

Hence, it was proved that the real culprit was the JAXB unmarshaller that was not generating the Java classes with the correct namespace.

First we checked the generated Java classes and found a file called as "package-info.java" that had a XML namespace 'annotated' on the package name - essentially a package annotation. So this was supposed to work, but why is the unmarshaller throwing an exception?

We tried adding the following attributes to the annotation and then the unmarshaller started working !!!
@javax.xml.bind.annotation.XmlSchema (
    namespace = "http://com.mypackage",
    elementFormDefault = javax.xml.bind.annotation.XmlNsForm.QUALIFIED,
    attributeFormDefault = javax.xml.bind.annotation.XmlNsForm.UNQUALIFIED
)

Not sure why the WSDL2Java tool did not add these automatically based on the elementFormDefault="qualified" present in the WSDL. But this trick worked and we could consume the .NET webservice. We had to modify the build script to replace the "package-info.java" file everytime it was recreated.

Another option is to manually add the "namespace" attribute to all XMLTypes in the generated Java classes; but this is very tedious.

Tuesday, September 25, 2012

Ruminating on Hibernate Entities and DTO

Today we had a lengthy debate on the hot topic of Hibernate entities vs DTO to be used across layers of a n-tiered application. Summarizing the key concepts that were clarified during our brainstorming :)

Q. Is is possible to use Hibernate entities in SOA style web services?

Ans. Yes. It is possible, but with a few important caveats and serious disadvantages.

If you are using Hibernate annotations to decorate your entities, then it would mandate that your service clients also have the necessary Hibernate/JPA jar files. This can be a big issue if you have non-Java clients (e.g. a .NET webservice client). If you use a mapping file (*.xml), then you are good to go as then there are no dependencies on Hibernate jars. But any change in your data model will affect your entities and this will result in changes to all your webservice consumers. So a lot of tight coupling :(

Also you have to understand that Hibernate has used Javaassist to create new dynamic proxies of your Hibernate entities. So the entities that you are referencing are actually that of the generated dynamic proxy - that contains a reference to the actual entity object. So U would need to deproxy the entity and then use it - either using dozer or a similar mapping library.

Person cleanPerson = mapper.map(person, Hibernate.getClass(person));

Note: In the above code, Hibernate.getClass() returns the actual entity class of the proxy object.

If you are using lazy loading (which most applications do), then you might encounter the famous "LazyInitializationException". This occurs when you detact the Hibernate entity and the serialization process would trigger the lazy loading for a property. To understand this exception, you need to understand the concept of sessions and transactions as given here.
If you do not have lazy loading and do a eager load of all your entities, then it would work - but this is only a short term solution that may work only for small applications.

Q. Ok. I understand the pitfalls. So if I go with DTO, will my problems be solved?

Ans. Yes, DTO is the best option, but you may encounter a few issues again.

Many developers encounter the "LazyInitializationException" when they try to use libraries such as a Dozer to copy properties. This happens because Dozer uses reflection behind the scenes to copy properties and again attempts to access uninitialized lazy collections. There are 2 ways to resolve this problem. Use a custom field mapper as shown here or use the latest version of Dozer that has a HibernateProxyResolver class to get the real entity class behind the Javaassist proxy. This is explained in the proxy-handling section of Dozer site.

Annotations are used at compile-time or runtime?

There is a lot of confusion among folks on the scope of annotations - whether annotations are used only at compile-time or also at runtime?

The answer is that it depends on the RETENTION POLICY of the annotation. An annotation can have one of the three retention policies.

RetentionPolicy.SOURCE: Available only during the compile phase and not at runtime. So they aren't written to the bytecode. Example: @Override, @SuppressWarnings

RetentionPolicy.CLASS: Available in the *.class file, but discarded by the JVM during class loading. Useful when doing bytecode-level post-processing. But not available at runtime.

RetentionPolicy.RUNTIME: Available at runtime in the JVM loaded class. Can be assessed using reflection at runtime. Example: Hibernate / Spring annotations.

Serializing JPA or Hibernate entities over the wire

In my previous post, we discussed about the need to map properties between Hibernate entities and DTOs. This is required, because Hibernate instuments the byte-code of the Java entity classes using tools such as JavaAssist.

Many developers often encounter the LazyInitializationException of Hibernate when you try to use the Hibernate/JPA entities in your webservices stack. To enable you to serialize the same Hibernate entites across the wire as XML, we have 2 options -

Use the entity pruner library - This library essentially removes all the hibernate dependencies from your entity classes by pruning them. Though this library is quite stable, developers should be careful about when to prune and unprune.
Use AutoMapper libraries - Using libraries such as Dozer makes it very easy to copy properties from the domain hibernate entity to a DTO class. For e.g.

PersonDTO cleanPerson = mapper.map(person, PersonDTO.class);

Out of the above 2 approaches, I would recommed to use the second one as it is clean and easy to use. It also enforces clean separation of concerns. A good discussion on StackOverFlow is available here.

Friday, September 21, 2012

How RSA Protected Configuration Provider works behind the scenes?

We were using the "RSA Protected Configuration provider" to encrypt sensitive information in our config files. I was suprised to see that the generated config file also had a triple-DES encrypted key.

So that means the config section is actually encrypted/decrypted using this symmetric key. But where is the key that has encrypted this key. It is here that the RSA public/private key pair come into picture. The public key in the RSA container is used to encrypt the DES key and the private key is used to decrypt the DES key. A good forum tread discussing this is available here.

There is also a W3C standard for XML encryption available here.

How does DPAPI work behind the scenes?

In my previous post, we saw how DPAPI can be used on Windows platforms to encrypt sensitive information without having to worry about key management. The advantage of using DPAPI is that data protection API is a core part of the OS and no additional libraries are required.

DPAPI is essentially a password-based data protection service and hence requires a password to provide protection. By default, the logged-on users password (hash) would be used for the same. A good explanation of the internal working of DPAPI is given here: http://msdn.microsoft.com/en-us/library/ms995355.aspx

Its interesting to see how MS has used concepts such as MasterKey and PRNG to generate keys that would be actually used for encryption.
I was intrigued to understand how DPAPI works when the user/administrator changes the password. Snippet from the article:
"DPAPI hooks into the password-changing module and when a user's password is changed, all MasterKeys are re-encrypted under the new password."

Thursday, September 20, 2012

Can we use DPAPI from Java?

In my previous post, we discussed on how DPAPI makes it simple to encrypt sensitive information without worrying about key generation and management.

I wondered if there was a Java API though which we can use DPAPI on windows machines. Found out that there is an open source JNI wrapper available for the same.
http://jdpapi.sourceforge.net/index.html

Also worth reading is this excellent post on encryption key management.

Ecrypting sensitive information in database

Recently for PCI complaince, we needed to encrypt credit card information (PAN) before storing it in the database. We decided to use AES 256 bit encryption for the same and developed a .NET component in the middle tier for the encryption/decryption.

After this, we faced the chicken-n-egg problem of storing the encryption key :)
In the past, we had used techniques such as storing the key in the windows registry and using RBAC to control access. Other option was to split the key into multiple files and use file-based OS permissions to control access.

But in the lastest version of .NET, you have another good option - i.e. DPAPI (Data Protection API). Using DPAPI, we can delegate key management to the operating system.
We do not provide a key with which to encrypt the data. Rather, the data is encrypted with a key derived from the logged-in user or system credentials. Thus we can pass any "sensitive" information to DPAPI and it would encrypt it using the "password" of the logged-in user or machine level authentication credentials.

The following MSDN links give detailed information on how to achieve this for connection strings - a common requirement scenario.

http://msdn.microsoft.com/en-us/library/ms998280.aspx

In our case, we placed our encryption key in a configuration section and encrypted that section using the methods described in the above link.
Because we were using a web farm scenario with two servers, we had 2 options- either use DPAPI on each server to encrypt the data using the machine specific key or use a separate key store/container and use RSA for encryption. In the first option, we would end up with different 'cipher-value' for the same input data. In the second option, the 'cipher-value' would be the same, but we would need to import the RSA keys on each server in the farm.

More information at this link: http://msdn.microsoft.com/en-us/library/ms998283.aspx

Why do we need Base64 encoding?

We often use Base64 encoding when we need to embed an image (or any binary data) within a HTML or XML file. But why do we need to encode it as Base64? What would happen if we don't encode it as Base64? A good discussion on this topic is available here.

Base64 was originally devised as a way to allow binary data to be attached to emails as MIME. We need to use Base64 encoding whenever we want to send binary data over a old ASCII-text based protocol. If we don't do this, then there is a risk that certain characters may be improperly interpreted. For e.g.

Newline chars such as 0x0A and 0x0D
Control characters like ^C, ^D, and ^Z that are interpreted as end-of-file on some platforms
NULL byte as the end of a text string
Bytes above 0x7F (non-ASCII)

We use Base64 encoding in HTML/XML docs to avoid characters like '<' and '>' being interpreted as tags.

Beware of using Hibernate 'increment' id generator

While reviewing one of the development projects using Hibernate as the ORM tool, I noticed that they were using the "increment" ID generation feature.

The way the implementation of this feature works is that it checks the highest value of the ID in the database and then increments it by one. But this approach fails miserable in a clusterned environment, because many threads might be writing to database concurrently and there is no locking.

It is best to stick to database specific features for auto-id generations, e.g. sequences in SQLServer 2012, Oracle 11g and identity columns in old versions of SQL Server.

Thursday, September 13, 2012

Delegated Authentication through a Login Provider

Nowadays it is common to see many internet sites delegating the authentication process to some third-party providers such as Google, Facebook, Twitter, etc.

So essentially you do not have to register and create a new login/password for the site. You can just use your existing login credentials of google, facebook, twitter, OpenID, LiveID, etc. Behind the scenes, the application and the login provider use standards such as OAuth and OpenID to do the magic.

The advantages of delegating authentication to a popular third-party provider are:

Users don't have to register and create another set of username/password for your site. Thus the user has fewer passwords to remember.
The application does not have to worry about creating a Login form and SSL enabling it.

Found 2 good libraries on google-code for enabling OAuth delegated authentication on your application:
http://code.google.com/p/socialauth/

http://code.google.com/p/socialauth-net/

Wednesday, September 12, 2012

How does Oracle RAC manage locks?

Typically databases store locks in-memory and hence it is challenging to load-balance or cluster 'write' operations to a shared database across multiple database instances.
For e.g. In my previous blog, I mentioned how SQL Server does not have any Oracle RAC equivalent.

So how does Oracle RAC managed to maintain locks & data consistency across multiple nodes of the RAC cluster?

The secret sauce is the high speed private network called the "interconnect". RAC has something called as "Cache Fusion Mechanism" that allows for inter-instance block transfers and cache consistency. The global cache services (GCS) is a set of processes that ensures that only one instance modifies the block at any given time. GCS also flags in-memory blocks as "invalid" whenever the blocks are changed in other nodes.

The following links have a good explanation on the RAC consistency mechanism:
http://www.rampant-books.com/art_rac_global_block_management.htm

http://www.rampant-books.com/art_burleson_rac_cache_coherency.htm

Tuesday, September 11, 2012

Ruminating on SQLServer HA and DR strategy

Recently for one of our customers, we were evaluating the options for configuring SQLServer 2008 for High Availability and Disaster Recovery.

Having successfully used Oracle RAC technology in the past many a times, I was suprised to realize that SQLServer does not have any equivalent to Oracle RAC. There is essentially no concept of load-balancing "read-write" requests between server instances working on a shared database storage system (e.g. SAN, RAID-10 array).

The only near equivalent to the RAC concept is using the SQLServer 2008 peer-to-peer transaction replication with updatable subscriptions. This is horribly complex to configure and maintain with data being replicated to peer servers. Also MS has finally decided not to support this feature anymore.

The clustering techniques of SQLServer 2008 use confusing words such as "active/active" and "active/passive". But in reality, there is no load-balancing of the database requests. The secondary server is typically in stand-by mode (as in active/passive). The 'active/active' concept essentially means that the SQL Servers are accessing two separate databases or a partitioned database. If one server fails, the processing is offloaded to the second machine. So now the second machine is running 2 server instances and this bogs down the resources of the server.

In SQLServer 2012, MS has tried to move closed to the RAC feature. We have a new feature of "AlwaysOn" that enables us to create HA groups and define replication strategies between the nodes in a HA group. The 'secondary' server would always be 'read-only' and can be used for reporting purposes or for 'read' operations. This is far better, as now we can atleast offload our 'read' operations to a different server, but we need to funnel all 'write' operations through the 'primary' server. A good discussion on this topic is available here.

After HA, we moved over to define our DR strategy. We first compared log-shipping vs mirroring. In Log Shipping, the secondary database is marked as 'read-only' and can be used for reporting, but the time taken for replication can be upto 30 mins. Hence it cannot be used if you desire instantaneous failover.
In Mirroring, there is almost instantaneous update to the secondary failover database, but since the database is always in recovery mode, it cannot be used for any purpose.

Again in Mirroring, we have 'synchronous' and 'asynchronous' replication. IMHO, synch replication is a disaster for any OLTP application and should actually never be used. Found a good case study on MS site that details out the strategy for "identity columns" when there is a potential for data-loss using async mirroring. Another option that can be considered for DR is SQL Replication.

Spring Security SecurityContextHolder

We have been happily using the static methods of "SecurityContextHolder" in our application. But as architects, it is important to understand what happends behind the scenes..

Where does Spring store the Principle object? My rough guess was that it would either be in Session or the ThreadLocal containers.
Found this great write-up that explains how the "Strategy" design pattern is used to store the UserDetails (Principle) object in the HTTPSession and then through a filter assigning it to the ThreadLocale container.

http://stackoverflow.com/questions/6408007/spring-securitys-securitycontextholder-session-or-request-bound

The advantage of this approach is that one can obtain the current 'Principle' anywhere in the code using simple static methods rather than being dependant on the request object or session object.

Monday, September 10, 2012

Connection Strings Reference.

Found this cool site containing a consolidated listing of "connection-strings" that can be applied to most of the databases out there. Quite valuable for quick reference :)

http://www.connectionstrings.com/

Friday, September 07, 2012

Estimating the size of SQL Server Database

MSDN has a very good link that explains in simple language on how we can estimate the size of the database. The size of the database would depend on the size of the tables and the indexes. The below link gives us easy formulas that can be used to calculate the database size.
http://msdn.microsoft.com/en-us/library/ms187445

A good samaritan also created excel templates that use these formulas for database sizing.

On Oracle, you can use PL/SQL SPROCs to calculate the table and index sizes. For e.g. you have the CREATE_TABLE_COST procedure of the DBMS_SPACE package that can be used to find out the bytes required for a table. You just have to input the potential number of rows expected.

Thursday, September 06, 2012

Ruminating on RACI matrix

The RACI matrix (Responsible, Accountable, Consulted, Informed) is an excellent tool for mapping between processes/functions and roles.
It brings in immediate clarity on Who’s Responsible for something, Who's Accountable for the final result, Who needs to be Consulted and Who all are kept Informed.

There are several templates available on the net that can be used for the RACI matrix. One good template is available here. Once data is collated on such a matrix, then we can analyse questions such as -

a) Are there too many A's? Usually its best to have only 1 person/role accountable for a process/function.
b) Are there too many I's? Too much information that is not needed? etc.

We can use RACI matrices anywhere. You could "RACI" any deliverable in your project. Also in EA governance, we can use RACI to clearly separate the responsibilities across the centralized EA team and project architects.Given below is an example of EA Governance RACI Matrix.

Friday, August 31, 2012

Centralized Clustered ESB vs Distributed ESB

Often folks ask me the difference between traditional EAI (Enterprise Application Integration) products and ESB (Enterprise Service Bus). Traditionally EAI products followed the hub-spoke model and when we look at a lot of topology options of ESB's then one would see that the hub-spoke model is still followed!

Given the fact that almost all EAI product vendors have metamorphized their old products to 'ESB' adds to the confusion. For e.g. The popular IBM Message Broker product line is being branded as 'Advanced ESB'. Microsoft has released a ESB Guidance Package that depends on the BizTalk platform, etc.

In theory, there is a difference between EAI (hub-n-spoke) architecture and ESB (distributed bus) architecture. In a hub/spoke model, the hub 'broker' becomes the single point of failure as all traffic is routed through it. This drawback can be addressed by clustering brokers for high availability.

In a distributed ESB, there is a network of brokers that colloborate to form a messaging fabric. So in essence, there is a small lightweight ESB engine that runs on every node where a SOA component is deployed. This lightweight ESB engine would do the message transformation and routing.

For e.g. in Apache ServiceMix, you can create a network of message brokers (ActiveMQ). Multiple instances of ServiceMix can be networked together using ActiveMQ brokers. The client application sees it as one logical normalized message router (NMR). But here again, the configuration info (e.g. routing information, service names, etc.) are centralized somewhere.

So then what is the fundamental difference between hub-n-spoke and ESB? IBM did a good job in clearing this confusion, by brining in 2 concepts/terms - "centralization of control config" and "distribution of infrastructure". A good blog explaining these concepts from IBM is here. Loved the diagram flow on this post. Point made :)

Snippet from IBM site:
"Both ESB and hub-and-spoke solutions centralize control of configuration, such as the routing of service interactions, the naming of services, etc.
Similarly, both solutions might deploy in a simple centralized infrastructure, or in a more sophisticated, distributed manner. In fact, if the Hub-and-spoke model has these features it is essentially an ESB."

As explained earlier, some opensource ESBs such as Apache Service Mix and Petals ESB are lightweight and have the core esb engine (aka service engine) deployed on each node. These folks call themselves "distributed ESB". Other vendors such as IBM, use the concept of "Federated ESBs" for distributed topologies across ESB domains.

Wednesday, August 29, 2012

Translate Chinese unicode code points to English

One of our legacy applications had the UI in chinese and it was required to convert it to English.
Instead of hiring a translator, we decided to use Google Translation Services.

But the application was picking up chinese lables/messages from a properties file. The properties file had the chinese characters expressed as unicode code points. The Google Translate webpage expected chinese characters to be typed or copy pasted onto the form. We searched for a similar translation service that would accept unicode code points, but in vain.

Finally, we decided to write a simple program that would write the chinese unicode codepoints to a file and then open the file using a program such as notepad++ or MS word. These programs support chinese characters and would allow you to copy paste them onto the Google Translation page.

Given below is the simple Java code snippet to write to a file. Please open this file using MS Word (or any other program that supports UTF-8 font rendering).
-------------------------------------------------

import java.io.File;
import com.google.common.io.Files;

public class Chinese_Chars {
    public static void main (String arg[])throws Exception{

        String str = "\u6587\u4EF6";
        byte[] array = str.getBytes("UTF-8");
        
        File file = new File("d:/temp.txt");
        Files.write(array, file);
    }
}

---------------------------------------------------

Show below are some screen shots of Google translate page and MS Word opening the file.

Tuesday, August 21, 2012

Understanding REST

In one of my old posts, I had elaborated on the differences between REST and simple POX/HTTP.

Recently came across an interesting post by Ryan Tomayko; where-in he trys to explain REST in a simple narrative style. A must read :) http://tomayko.com/writings/rest-to-my-wife

Another interesting discussion thread on REST is available at StackOverFlow - regarding verbs and error codes.

File Upload Security - Restrict file types

In web applications, we often have to restrict the file types that can be uploaded to the server. One way to restrict it is by checking the file extensions. But what if someone changes the file extension and tries to upload a file.

For common file types such as GIF, PDF, JPEG we can check the contents of the file for a "signature" or "magic number". More information given in this blog post - http://hrycan.com/2010/06/01/magic-numbers/

The Apache Tika project can be used to quickly extract meta-data information from a file stream.

List of names...

Very often, we quickly need to populate a sample database with a list of names. In the past, we often did this by using random numbers appended to some common names.

But found this cool site on the web that gives us thousands of sample names that can be used to populate our databases for demo purposes.

http://www.listofnames.info/

Monday, August 13, 2012

Ruminating on JSF

In the past, I always hated JSF, the same way I hated EJB 2.x. But of-late, I am seeing a renewed interest in JSF, especially since a lot of pain areas were resolved in the JSF 2.0 specification.

Over the last couple of days, I have been evaluating PrimeFaces - an opensource implementation of JSF 2.0 and I would say that I am pretty impressed. Going through the sample demo pages, I was mighty pleased with the neat and clean code on both the XHTML file and the POJO beans Java code.

Also PrimeFaces has a plethora of components that should suffice for 90-95% of all basic web application requirements.

In general, if you do not require heavy UI customization and can hence sacrifice absolute control over the generated HTML, CSS and JS, then I would say that using PrimeFaces would greatly increase the productivity of an average development team. IMHO, the productivity gain could be as high as 50% over doing the conventional plumbing using MVC frameworks and JQuery.

But if there is a special requirement that cannot be sufficed by the standard UI components provided by PrimeFaces, then you are in trouble. You would then need deeper expertise to write your own JSF component or customize existing ones.

Based on my study and the common challenges we faced, I am jotting down some FAQ's that should be useful for folks embarking on using PrimeFaces.

Q) How to manipulate JSF components on the client side using plain JS or JQuery? How to use JQuery JS API or any other JS library on the client side with JSF?

http://stackoverflow.com/questions/7927716/how-to-select-primefaces-ui-or-jsf-components-using-jquery

http://stackoverflow.com/questions/5457292/jquery-conflicts-with-primefaces

Q) How to include JS file or inline JS code in a JSF XHTML page?
A) There are 3 ways to handle this.

Escape all special chars like 'greater than' or 'lesser than'
Use <![CDATA[ ... ]]> to hold your JavaScript code
Put the JavaScript code in a separate .js file, and use in the JSF page

http://www.mkyong.com/jsf2/how-to-include-javascript-file-in-jsf/

Q) How do I output HTML text in JSF? Do I need to use the 'verbatim' tag?
A) <h:outputtext escape="false" value="#{daBean.markedUpString}"></h:outputtext>

Q) Can I mix HTML tags with JSF tags?
A) You can. It is not as much as a pain as in JSF 1.x, but you need to be aware of issues.
http://stackoverflow.com/questions/5474178/jsf-facelets-why-is-it-not-a-good-idea-to-mix-jsf-facelets-with-html-tags

Friday, August 03, 2012

SOA interoperability - .NET WCF from Java

Recently, one of our teams was struggling to integrate a Java application with a .NET WCF service. The exception that was thrown on the Java side was a SOAPFault as shown below:

SOAPFaultException: The message with Action 'https://tempService:48493/VirtualTechnician/AcceptVehicleTermsOfService' cannot be processed at the receiver, due to a ContractFilter mismatch at the EndpointDispatcher. This may be because of either a contract mismatch (mismatched Actions between sender and receiver) or a binding/security mismatch between the sender and the receiver. Check that sender and receiver have the same contract and the same binding (including security requirements, e.g. Message, Transport, None).

After a lot of debugging using SOAPUI and WireShark, we found out that the problem was not in the SOAP message, but in the HTTP header. The SOAP Action HTTP header needs to be set in the HTTP Post Request.

On JAX-WS, it can be done with the following code snippet:

BindingProvider bp = (BindingProvider) smDispatch;
            bp.getRequestContext().put(BindingProvider.SESSION_MAINTAIN_PROPERTY, Boolean.TRUE);
            bp.getRequestContext().put(BindingProvider.SOAPACTION_USE_PROPERTY, Boolean.TRUE);
            bp.getRequestContext().put(BindingProvider.SOAPACTION_URI_PROPERTY, "http://tempuri.org/IVtOwl/AcceptVehicleTermsOfService");// Correct SOAP Action

Custom PMD rules using XPath

Writing custom rules in PMD using XPath is an exciting concept, but unfortunately there are not many good tutorials or reference guides available on the internet for this.

Recently, we wanted to write custom PMD rules to extract Spring JDBC calls from the code base. We utilized the PMD desiger that is provided OOTB in Eclipse to easily write the rules.

Just open Eclipse -> Preferences -> PMD -> Rule Designer.
In Rule Designer, copy-paste your source code and check the AST (Abstract Syntax Tree) that is formed. You can also copy the AST XML from the menu bar and paste it on to a text editor. Writing the XPath expression then becomes very very simple !!!

For e.g. for finding our the Spring JDBC query calls the XPath was:
//PrimaryPrefix[Name[starts-with(@Image,'jdbcTemplate.query')]]

Wednesday, August 01, 2012

Ruminating on "Trickle Batch Load".

Traditionally most of the batch jobs were run at end-of-day; where the entire day transaction log was pulled out and processed in some way. During the good old days, when the volume of data was low, these batch processes could easily meet their business SLAs. Even if there was a failure, there was sufficient time to correct the data and rerun the process and still meet the SLA's.

But today, the volume of data has grown exponentially. Across many of our customers, we have seen challenges around meeting SLA's due to large data volumes. Also business stakeholders have become more demanding and want to see sales reports and spot trends early - many times during a day. To tackle these challenges, we have to resort to the 'trickle batch' load design pattern.

In trickle batch, small delta changes from source systems are sent for processing mutiple times a day. There are various advantages of such a design -

The business gets real-time access to critical information. For e.g. sales per hour.
Business SLA's can be easily met, as EOD processes are no longer a bottleneck.
Operational benefits include low network latency and reduced CPU/resource utilization.
Early detection of issues - network issues, file corruption, etc.

The typically design strategies used to implement "trickle batch" are using the CDC (Change Data Capture) capabilities of Databases and ETL tools.
Today almost all ETL tools such as IBM InfoSphere DataStage, Informatica, Oracle Data Integrator, etc. have in-built CDC capabilities.

Trickle batches typically feed a ODS (Operational Data Store) that can run transactional reports. Data from the ODS is then fed to a DW appliance for MIS reporting.