Tech Talk: November 2005

Friday, November 25, 2005

Microsoft Enterprise Application blocks

Recently, Microsoft's Pattern & Practices Group has released the Enterprise Library, a large configurable and extensible software library that consists of seven integrated application blocks.

For more information and download visit http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnpag2/html/entlib.asp

The application blocks that comprise the Enterprise Library are the following:
Caching Application Block. This application block allows developers to incorporate a local cache in their applications.
Configuration Application Block. This application block allows applications to read and write configuration information.
Data Access Application Block. This application block allows developers to incorporate standard database functionality in their applications.
Cryptography Application Block. This application block allows developers to include encryption and hashing functionality in their applications.
Exception Handling Application Block. This application block allows developers and policy makers to create a consistent strategy for processing exceptions that occur throughout the architectural layers of enterprise applications.
Logging and Instrumentation Application Block. This application block allows developers to incorporate standard logging and instrumentation functionality in their applications.
Security Application Block. This application block allows developers to incorporate security functionality in their applications. Applications can use the application block in a variety of situations, such as authenticating and authorizing users against a database, retrieving role and profile information, and caching user profile information.

Ruminating on Transactions

Most of us have worked on transactions one time or the other. Transactions can be broadly classified as local or distributed.
Local transactions are of the simplest form in which the application peforms CRUD operations with a single datasource/database. The simplest form of relational database access involves only the application, a resource manager, and a resource adapter. The resource manager can be a relational database management system (RDBMS), such as Oracle or SQL Server. All of the actual database management is handled by this component.
The resource adapter is the component that is the communications channel, or request translator, between the "outside world," in this case the application, and the resource manager. In Java applications, this is a JDBC driver.

In a distributed transaction, the transaction accesses and updates data on two or more networked resources, and therefore must be coordinated among those resources.
These resources could consist of several different RDBMSs housed on a single sever, for example, Oracle, SQL Server, and Sybase; or they could include several instances of a single type of database residing on a number of different servers. In any case, a distributed transaction involves coordination among the various resource managers. This coordination is the function of the transaction manager. The transaction manager is responsible for making the final decision either to commit or rollback any distributed transaction. A commit decision should lead to a successful transaction; rollback leaves the data in the database unaltered

The first step of the distributed transaction process is for the application to send a request for the transaction to the transaction manager. Although the final commit/rollback decision treats the transaction as a single logical unit, there can be many transaction branches involved. A transaction branch is associated with a request to each resource manager involved in the distributed transaction. Requests to three different RDBMSs, therefore, require three transaction branches. Each transaction branch must be committed or rolled back by the local resource manager. The transaction manager controls the boundaries of the transaction and is responsible for the final decision as to whether or not the total transaction should commit or rollback. This decision is made in two phases, called the Two-Phase Commit Protocol.

In the first phase, the transaction manager polls all of the resource managers (RDBMSs) involved in the distributed transaction to see if each one is ready to commit. If a resource manager cannot commit, it responds negatively and rolls back its particular part of the transaction so that data is not altered.

In the second phase, the transaction manager determines if any of the resource managers have responded negatively, and, if so, rolls back the whole transaction. If there are no negative responses, the translation manager commits the whole transaction, and returns the results to the application.

Making web service proxy URL dynamic in .NET

In VS.NET, whenever we add a web-reference to a webservice, what happens behind the scenes is that a proxy is created which contains the URL to the webservice.

Now if we need to shift the webservice to a production server, do we have to create the proxy again?. (The IP address of the webservice server would have changed)

Fortunately there is a easy way out. Just right-click on the proxy in VS.NET and change its URL property from 'static' to 'dynamic'. And Voila !!!, VS.NET automatically adds code to the proxy class and web.config. The newly added code in the proxy will first check for a URL key in the web.config and use it to bind to the webservice.
So when we move the webservice to a different server, we just have to change the URL in the web.config file.

Thursday, November 24, 2005

Object pooling in Java

Quite often, we may have to use object pooling to conserve resources and increase performance.
I came across some pretty good abstract components that can be used to develop a object pooling mechanism.

Check out the followling link to see all the classes that are available for use:
http://jakarta.apache.org/commons/pool/

An interesting aspect of the package was the clear separation of object creation from the way the objects are pooled. There is a PoolableObjectFactory interface that provides a generic interface for managing the lifecycle of a pooled instance.

By contract, when an ObjectPool delegates to a PoolableObjectFactory : -

makeObject is called whenever a new instance is needed.
activateObject is invoked on every instance before it is returned from the pool.
passivateObject is invoked on every instance when it is returned to the pool.
destroyObject is invoked on every instance when it is being "dropped" from the pool (whether due to the response from validateObject, or for reasons specific to the pool implementation.)
validateObject is invoked in an implementation-specific fashion to determine if an instance is still valid to be returned by the pool. It will only be invoked on an "activated" instance.

Doug Lea's Home Page

I have been impressed with the genius of Doug Lea when I saw his concurrent package in Java. His home page contains a lot of interesting links : http://gee.cs.oswego.edu/dl/

Chief among them are :

A online OO book : http://gee.cs.oswego.edu/dl/oosdw3/index.html
Roles Before Objects : http://gee.cs.oswego.edu/dl/rp/roles.html

This page is worth a perusal.

Thread pools in Java

I have often used thread pools in my applications. My favourite opensource thread pool was the PooledExecutor class written by Doug Lea (available here )

But it's also important to understand when to use thread pools and when not to use. I recently came across a good article on the web - http://www-128.ibm.com/developerworks/java/library/j-jtp0730.html

In this article the author argues that using thread pools makes sense whenever U have a large number of tasks that need to be processed and each task is short-lived. For e.g. Webservers, FTP servers etc.
But when the tasks are long running and few in number it may make sense to actually spawn a thread for each task. Another common threading model is to have a single background thread and task queue for tasks of a certain type. AWT and Swing use this model, in which there is a GUI event thread, and all work that causes changes in the user interface must execute in that thread.

Java 5.0 has come up with a new package java.util.concurrent that contains a lot of cool classes if U need to implement threading in Ur applications.

Thursday, November 17, 2005

StringBuilder in Java 5.0

There is a new class in Java5.0 - StringBuilder class that can be used whenever we are doing some heavy-duty string manipulation and we do not wish to get bogged down by the immutable property of strings.
But then what happened to the old good StringBuffer class. Well, it's still there.
Here's what the JavaDoc says:

This StringBuilder class provides an API compatible with StringBuffer, but with no guarantee of synchronization. This class is designed for use as a drop-in replacement for StringBuffer in places where the string buffer was being used by a single thread (as is generally the case). Where possible, it is recommended that this class be used in preference to StringBuffer as it will be faster under most implementations.

Instances of StringBuilder are not safe for use by multiple threads. If such synchronization is required then it is recommended that StringBuffer be used.

Wednesday, November 09, 2005

Understanding Unicode

I still see many people with a lot of myths about Unicode. I guess the reason for this is that a lot of people still feel that Unicode is a encoding format that uses 16 bits to represent a character.
Let's put a few things in perspective here:

Unicode is a standard which has defined a character code for every character in most of the speaking languages in the world. Also it has defined a character code for items such as scientific, mathematical, and technical symbols, and even musical notation. These character codes are also known as code points.

Unicode characters may be encoded at any code point from U+0000 to U+10FFFF, i.e. Unicode reserves 1,114,112 (= 220 + 216) code points, and currently assigns characters to more than 96,000 of those code points. The first 256 codes precisely match those of ISO 8859-1, the most popular 8-bit character encoding in the "Western world"; as a result, the first 128 characters are also identical to ASCII.

The number of bits used to represent each code point may differ - e.g. 8 bits, 16 bits.
The size of the code unit used for expressing those code points may be 8 bits (for UTF-8), 16 bits (for UTF-16), or 32 bits (for UTF-32).
So what this means is that there are several formats for storing Unicode code points. When combined with the byte order of the hardware (BE or LE), they are known officially as "character encoding schemes." They are also known by their UTF acronyms, which stand for "Unicode Transformation Format"

UTF-8 is widely used because the first 128 bits in the byte are ASCII, and although up to four bytes can be used, only one byte is required for use in the English speaking world. UTF-16 and UTF-32 use a fixed number of bytes.

So to put in other words, Unicode text can be represented in more than one way, including UTF-8, UTF-16 and UTF-32. So, hey...what's this UTF ?

A Unicode transformation format (UTF) is an algorithmic mapping from every Unicode code point to a unique byte sequence. UTF-8 is most common on the web. UTF-16 is used by Java and Windows. UTF-32 is used by various Unix systems. The conversions between all of them are algorithmically based, fast and lossless. This makes it easy to support data input or output in multiple formats, while using a particular UTF for internal storage or processing.

For more information visit http://www.unicode.org/faq/

On Bytes and Chars...

During i8n and localization, we often come across basic fundamental issues such as:
- How many bytes make a character?
- How many characters/bytes are present in a string?

Each character gets encoded into bytes according to a specific charset. For e.g. ASCII uses 7 bit encoding, i.e. each char is represented by 7 bits. ANSI/Cp1521 uses 8-bit encoding, Unicode uses 16 bit encoding. UTF-8, which is a popular encoding set on the internet is a multibyte Unicode charset. So if someone asks - how many bytes make a character - the answer is - it depends on the charset used to encode the character.

Another interesting point in Java is the difference btw a 'char' and a character.
When we do "String.length()" in Java, we get the number of chars in the string. But a Unicode character may be made up of more than one 'char'.
This blog throws light on this concept: http://forum.java.sun.com/thread.jspa?threadID=671720

Snippet from the above blog:
---------------------------------
A char is not necessarily a complete character. Why? Supplementary characters exist in the Unicode charset. These are characters that have code points above the base set, and they have values greater than 0xFFFF. They extend all the way up to 0x10FFFF. That's a lot of characters. In Java, these supplementary characters are represented as surrogate pairs, pairs of char units that fall in a specific range. The leading or high surrogate value is in the 0xD800 through 0xDBFF range. The trailing or low surrogate value is in the 0xDC00 through 0xDFFF range. What kinds of characters are supplementary? You can find out more from the Unicode site itself.

So, if length won't tell me home many characters are in a String, what will? Fortunately, the J2SE 5.0 API has a new String method: codePointCount(int beginIndex, int endIndex). This method will tell you how many Unicode code points are between the two indices. The index values refer to code unit or char locations, so endIndex - beginIndex for the entire String is equivalent to the String's length. Anyway, here's how you might use the method:

int charLen = myString.length();
int characterLen = myString.codePointCount(0, charLen);

Tuesday, November 08, 2005

Preventing SQL injection attacks...

I still see a lot of applications that are vulnerable to SQL injection attacks bcoz of usage of dynamic SQL using string concatenation. It is a common myth that to circumvent SQL injection attacks we 'have' to use stored procedures.
Even if we are using dynamic SQL, it is pretty simple to avoid these attacks.

In .NET the following techniques can be used:
Step 1. Constrain input - Validate input using client side validation and server side validation (e.g. using regular expressions)
Step 2. Use parameters with stored procedures. There is one caveat with Stored procedures. If Ur stored procedure uses the 'EXEC' command which takes a string, then the same vulnerability exists there too.
Step 3. Use parameters with dynamic SQL. (Yepppiee...in .NET it's so simple to have named parameters even for dynamic SQL)

Code snippet :
-----------------------------------
SqlDataAdapter myDataAdapter = new SqlDataAdapter( "SELECT au_lname, au_fname FROM Authors WHERE au_id = @au_id", connection);
myCommand.SelectCommand.Parameters.Add("@au_id", SqlDbType.VarChar, 11); myCommand.SelectCommand.Parameters["@au_id"].Value = SSN.Text;
-----------------------------------

Other important points to be considred are using a least-privileged database account and avoiding disclosing error information to the user.
A good article discussing this is at: http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnpag2/html/paght000002.asp

In Java, one can use 'Prepared Statements' and 'Stored Procedures' to prevent SQL injection attacks.

Common terms in Datawarehousing.

I have been interested in datawarehousing concepts for a long time, but unfortunately never got a chance to work on them. Here are some common concepts U need to know to understand any datawarehousing glibber.

Data Warehousing: An enterprise-wide implementation that replicates data from the same publication table on different servers/platforms to a single subscription table. This implementation effectively consolidates data from multiple sources.

Data Warehouse: A data warehouse is a subject oriented, integrated, non volatile, time variant collection of data. The data warehouse contains atomic level data and summarized data specifically structured for querying and reporting.

Data Mining: The process of finding hidden patterns and relationships in data. For instance, a consumer goods company may track 200 variables about each consumer. There are scores of possible relationships among the 200 variables. Data mining tools will identify the significant relationships.

OLAP (On-Line Analytical Processing)
Describes the systems used not for application delivery, but for analyzing the business, e.g., sales forecasting, market trends analysis, etc. These systems are also more conducive to heuristic reporting and often involves multidimensional data analysis capabilities.

OLTP (OnLine Transaction Processing)
Describes the activities and systems associated with a company's day-to-day operational processing and data (order entry, invoicing, general ledger, etc.).

Data Dictionary:
a software tool for recording the definition of data, the relationship of one category of data to another, the attributes and keys of groups of data, and so forth.

Tech Talk