Monday, January 29, 2018

Ruminating on the V model of software testing

In the V model of software testing, the fundamental concept in to interweave testing activities into each and every step of the development cycle. Testing is NOT a separate phase in the SDLC, but rather testing activities are carried out right from the start of the requirements phase.

  • During requirements analysis, the UAT test cases are written. In fact, many teams have started using user stories and acceptance criteria as test cases. 
  • System test cases and Performance test cases are written during the Architecture definition and Design phase. 
  • Integration test cases are written during the coding and unit testing phase. 
A good illustration of this is given at

Monday, January 22, 2018

Ruminating on Netflix Conductor Orchestration

We have been evaluating the Netflix open source Conductor project and are intrigued by the design decisions made in it.

Netflix Conductor can be used for orchestration of microservices. Microservices are typically loosely coupled using pub/sub semantics and leveraging a resilient message broker such as Kafka. But this simple and proven event driven architecture did not suffice the needs for Netflix.

A good article on the challenges faced by the Netflix team and the genesis of Conductor is available here. Snippets from the article:

"Pub/sub model worked for simplest of the flows, but quickly highlighted some of the issues associated with the approach:

  • Process flows are “embedded” within the code of multiple applications - e.g. If you have a pipeline of publishers and subscribers, it becomes difficult to understand the big picture without proper design documentation. 
  • There is tight coupling and assumptions around input/output, SLAs etc, making it harder to adapt to changing needs  - i.e. How will you monitor that the response for a particular task is completed within a 3 second SLA? If not done within this timeframe, we need to mark this task as a failure. Doing this is very difficult with pub/sub unless we code this ourselves.
  • No easy way to monitor the progress - When did the process complete? At what step did the transaction fail?

To address all the above issues, the Netflix team created Conductor. Conductor servers are also stateless and can be deployed on multiple servers to handle scale and availability needs.

Ruminating on idempotency of REST APIs

Most REST services are stateless - i.e. they do not store state in memory and can be scaled out horizontally.

There is another concept called a 'idempotent' operation. An operation is considered to be idempotent if making the same call with the same input parameters repeatedly gives the same results. In other words, the service does not differentiate between one call or a million calls.

Typically all GET, HEAD, OPTIONS calls are safe and idempotent. PUT and DELETE are also considered idempotent, but they can return different responses, but with no state change. A good video describing this is here  -

POST calls are NOT idempotent, as every call creates a new record. 

Wednesday, January 17, 2018

Why GPU computing and Deep Learning are a match made in heaven?

Deep learning is a branch of machine learning that uses multi-layered neural networks for solving a number of challenging AI problems.

GPU (Graphics Processing Unit) architecture are fundamentally different from CPU architecture. A GPU chip would be significantly slower that a CPU, but a single GPU might have thousands of cores while a CPU usually has not more than 12 cores.

Hence any task that can be parallelized over multiple core is a perfect fit for GPUs. Now it so happens, that deep learning algorithms involve a lot of matrix multiplications that are an excellent candidate for parallel processing over the cores of a GPU. Hence GPUs make deep learning algorithms run faster by an order of magnitude. Even training of models is much faster and hence you can expedite the GTM of your AI solutions.

An example of how GPUs are better for AI solutions is considering AlexNet, a well known image classification deep network. A modern GPU costing about $1000 takes 2.5 days to fully train AlexNet on the very large ImageNet dataset. On the otherhand, it takes a CPU costing several thousand dollars nearly 43 days.

A good video demonstrating the difference between GPU and CPU is here:

Timeout for REST calls

Quite often, we need a REST API to respond within a specified time frame, say 1500 ms.
Implementing this using the excellent Spring RestTemplate is very simple. A good blog explaining how to use RestTemplate is available here -


Monday, January 08, 2018

The curious case of Java Heap Memory settings on Docker containers

On docker containers, developers often come across OOM (out-of-memory) errors and get baffled because most of their applications are stateless applications such as REST services.

So why do docker containers throw OOM errors even when there are no long-living objects in memory and all short-lived objects should be garbage collected?
The answer lies in how the JVM calculates the max heap size that should be allocated to the Java process.

By default, the JVM assigns 1/4 of the total physical memory as the max heap size for the Java runtime. But this is the physical memory of the server (or VM) and not of the docker container.
So let's say your server has a memory of 16 GB on which you are running 8 docker containers with each container configured for 2 GB. But your JVM has no understanding of the docker max memory size. The JVM will allocate 1/4 of 16 GB = 4 GB as the max heap size. Hence garbage collection MAY not run unless the full heap is utilized and your JVM memory may go beyond 2 GB.
When this happens, docker or Kubernetes would kill your JVM process with an OOM error.

You can simulate the above scenario and print the Java heap memory stats to understand the issue. Print the runtime.freeMemory(), runtime.MaxMemory() and runtime.totalMemory() to understand the patterns of memory allocation and garbage collection.

The diagram below gives a good illustration of the memory stats of JVM.

So how do we solve the problem? There is a very simple solution - Just explicitly set the max memory of the JVM when launching the program - e.g. java -Xmx 2000m -jar myjar.jar
This will ensure that the garbage collection runs appropriately and an OOM error does not occur.
A good article throwing more details on this is available here.

Also, it is important to understand that the total memory consumed by a Java process (called as Resident Set Size RSS) is equal to the heap size + Perm Size (MetaSpace) + native memory (required for thread stacks, file pointers). If you have a large number of threads, then native memory consumption can also be high (No. of threads * -Xss)

Max memory = [-Xmx] + [-XX:MaxPermSize/MaxMetaSpace] + [number_of_threads * (-Xss)] 

Since JDK 8, you can also utilize the following -XX parameters:
-XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap //This allows the JDK to understand the CGroup memory limits on the Linux kernel
-XX:MaxRAM //This specifies the max memory allocated to the JVM...includes both on-heap and off-heap memory. 

In Java 10, a lot of changes have been made to the JDK to make it container friendly and you would not need to specify any parameters.

Other articles worth perusing are:

If you wish to check all the -XX options for a JVM, then you can specify the following command.
java -XX:+UnlockDiagnosticVMOptions -XX:+PrintFlagsFinal -version

If you have a debug build of Java, then you can also try:
java -XX:+UnlockDiagnosticVMOptions -XX:+UnlockExperimentalVMOptions -XX:+PrintFlagsFinal -XX:+PrintFlagsWithComments -version

There are also tools that you can utilize to understand the best memory options to configure, such as the Java Memory Build Pack.