Saturday, May 13, 2023

Ruminating on Prompt Engineering

There has been a lot of buzz in recent years about the potential of large language models (LLMs) to develop new text forms, translate languages, compose various types of creative material, and answer your queries in an instructive manner. However, one of the drawbacks of LLMs is that they may be quite unexpected. Even little changes to the prompt might provide drastically different outcomes. This is where quick engineering comes into play.

The technique of creating prompts that are clear, explicit, and instructive is known as prompt engineering. You may maximise your chances of receiving the desired outcome from your LLM by properly writing your questions.

Given below are some of the techniques you can use to create better prompts:

  • Be precise and concise: The more detailed your instruction, the more likely your LLM will get the intended result. Instead of asking, "Write me a poem," you may say, "Write me a poem about peace".
  • Use keywords: Keywords are words or phrases related to the intended outcome. If you want your LLM to write a blog article about generative AI, for example, you might add keywords like "prompt engineering," "LLMs," and "generative AI."
  • Provide context: Context is information that assists your LLM in comprehending the intended outcome. If you want your LLM to write a poetry about Spring, for example, you might add context by supplying a list of phrases around Spring.
  • Provide examples: Use examples to demonstrate to your LLM what you are looking for. For example, if you want your LLM to create poetry, you may present samples of poems you appreciate.
Andrew NG has created an online course to learn about prompt engineering here -

In fact, the rise of LLMs has resulted in new job roles like "Prompt Engineer" as highlighted in the articles below: 

Monday, January 16, 2023

API mock servers from OpenAPI specs

 If you have an OpenAPI specs file (YAML or JSON), then you can quickly create a mock server using one of the following tools. 

A list of all other OpenAPI tools is given here:

Saturday, November 19, 2022

Ruminating on the internals of K8

Today Kubernetes has become the defacto standard to deploy applications. To understand what happens behind the scenes when you fire "kubectl" commands, please have a look at this excellent tutorial series by VMWare -

Some key components of the K8 ecosystem. The control plane consists of the API server, Scheduler, etcd and Controller Manager. 

  • kubectl: This is a command line tool that sends HTTP API requests to the K8 API server. The config parameters in your YAML file are actually converted to JSON and a POST request is made to the K8 control plane (API server).
  • etcd: etcd (pronounced et-see-dee) is an open source, distributed, consistent key-value store for shared configuration, service discovery, and scheduler coordination of distributed systems or machine clusters. Kubernetes stores all of its data in etcd, including configuration data, state, and metadata. Because Kubernetes is a distributed system, it requires a distributed data store such as etcd. etcd allows every node in the Kubernetes cluster to read and write data.
  • Scheduler: The kube-scheduler is the Kubernetes controller responsible for assigning pods to nodes in the cluster. We can give hints in the config for affinity/priority, but it is the Scheduler that decides where to create the pod based on memory/cpu requirements and other config params.
  • Controller Manager: A collection of 30+ different controllers - e.g. deployment controller, namespace controller, etc. A controller is a non-terminating control loop (daemon that runs forever) that regulates the state of the system - i.e. move the "existing state" to the "desired state" - e.g. creating/expanding a replica set for a pod template. 
  • Cloud Controller Manager: A K8 cluster has to run on some public/private cloud and hence has to integrate with the respective cloud APIs - to configure underlying storage/compute/network. The Cloud Controller Manager makes API calls to the Cloud Provider to provision these resources - e.g. configuring persistent storage/volume for your pods.  
  • kubelet: The kubelet is the "node agent" that runs on each node. It registers the node with the apiserver. it provides an interface between the Kubernetes control plane and the container runtime on each node in the cluster.  After a successful registration, the primary role of kubelet is to create pods and listen to the API server for instructions.
  • kube-proxy: The Kubernetes network proxy (aka kube-proxy) is a daemon running on each node. It monitors the changes of service and endpoint in the API server, and configures load balancing for the service through iptables. Kubernetes gives pods their own IP addresses and a single DNS name for a set of Pods, and can load-balance across them
Everything in K8 is configured using manifest files (YAML) and hence as users, we just need to use the kubectl command with the appropriate manifest files. Each YAML file represents a K8 object. A Kubernetes object is a "record of intent"--once you create the object, the Kubernetes system will constantly work to ensure that object exists. By creating an object, you're effectively telling the Kubernetes system what you want your cluster's workload to look like; this is your cluster's desired state - e.g. A "deployment" K8 object (with its YAML) provides declarative updates for Pods and ReplicaSets.

Wednesday, September 07, 2022

Ruminating on Hypothesis testing

The following two articles by Rebecca Bevans are an excellent introduction to the concept of Hypothesis testing and the types of statistical tests available:

Snippet from the article on the process of hypothesis testing:

Step 1: State your null and alternate hypothesis

Step 2: Collect data

Step 3: Perform a statistical test

Step 4: Decide whether to reject or "fail to reject" your null hypothesis

Step 5: Present your findings

Free Stats & Finance courses

The following site has an excellent collection of 20 free courses that I would highly recommend for folks who want to learn the basics of finance and fundamentals of maths/stats in finance.

I really liked the following courses and helped me consolidate my understanding:

- Stats basics:

- Accounting basics:

- How to read financial statements:

- Data Science fundamentals -

Continuous, Discreet and Categorical variables

The following websites gives an excellent overview for beginners of the 3 different types of variables that we encounter in feature engineering (or even in basic stats):

Snippets from the articles:

A discrete variable only allows a particular set of values, and in-between values are not included. If we are counting a number of things, that is a discrete value. A dice roll has a certain number of outcomes, and nothing else (we can roll a 4 or a 5, but not a 4.6). A continuous variable can be any value in a range. Usually, things that we are measuring are continuous variables, because it can be any value. The length of a car ride might be 2 hours, 2.5 hours, 2.555, and so on.

Categorical variables are descriptive and not numerical. So any way to describe something is a categorical variable. Hair color, gum flavor, dog breed, and cloud type are all categorical variables.

There are 2 types of categorical variables: Nominal categorical variables are not ordered. The order doesn't matter. Eye color is nominal, because there is no higher or lower eye color. There isn't a reason one is first or last.

Ordinal categorical variables do have an order. Education level is an ordinal variable, because they can be put in order. Note that there is not some exact difference between the levels of education, just that they can be put in order.

Wednesday, August 31, 2022

Ruminating on TMForum

The TM Forum (TMF) is an organisation of over 850 telecom firms working together to drive digital innovation. They created a standard known as TMF Open APIs, which provides a standard interface for the interchange of various telco data.

TM Forum’s Open APIs are JSON-based and follow the REST paradigm. They also share a common data model for Telecom.

Any CSP (Communications service provider) can accelerate their API journey by leveraging the TMForum API contracts. The link below gives some of the examples of the API standards available:

Currently, there are around 60+ APIs defined in the Open API table of TMForum. 

Few examples of the APIs are as follows:

  • Customer Bill Management API: This API allows operations to find and retrieve one or several customer bills (also called invoices) produced for a customer.
  • Customer Management API: Provides a standardized mechanism for customer and customer account management, such as creation, update, retrieval, deletion and notification of events.
  • Digital Identity Management API: Provides the ability to manage a digital identity. This digital identity allows identification of an individual, a resource, or a party Role (a specific role - or set of roles - for a given individual).
  • Account Management API: Provides standardized mechanism for the management of billing and settlement accounts, as well as for financial accounting (account receivable) either in B2B or B2B2C contexts.
  • Geographic Address Management API: Provides a standardized client interface to an Address management system. It allows looking for worldwide addresses
  • Geographic Site Management API: Covers the operations to manage (create, read, delete) sites that can be associated with a customer, account, service delivery or other entities.
  • Payment Management API: The Payments API provides the standardized client interface to Payment Systems for notifying about performed payments or refunds.
  • Payment Method Management API: This API supports the frequently-used payment methods for the customer to choose and pay the usage, including voucher card, coupon, and money transfer.
  • Product Ordering Management API: Provides a standardized mechanism for placing a product order with all the necessary order parameters.
  • Promotion Management API: Used to provide the additional discount, voucher, bonus or gift to the customer who meets the pre-defined criteria.
  • Recommendation Management API: Recommendation API is used to recommend offering quickly based on the history and real-time context of a customer.
  • Resource Function Activation Management API: This API introduces Resource Function which is used to represent a Network Service as well as a Network Function.

The GitHub repository of TMForum is a great place to get acquainted with the APIs -

Since the TMForum defines the data model in JSON format, any noSQL datastore that stores data as JSON documents becomes an easy option to quickly implement an API strategy. For example, TMF data model of the API can be persisted 1:1 in Mongo database without the need for additional mappings as shown here -

Monday, August 29, 2022

mAP (mean Average Precision) and IoU (Intersection over Union) for Object Detection

mAP (mean Average Precision) is a common metric used for evaluating the accuracy of object detection models. The mAP computes a score by comparing the ground-truth bounding box to the detected box. The higher the score, the more precise the model's detections.

The following articles give a good overview of the concepts of precision, recall, mAP, etc.

Some snippets from the above article:

"When a model has high recall but low precision, then the model classifies most of the positive samples correctly but it has many false positives (i.e. classifies many Negative samples as Positive). When a model has high precision but low recall, then the model is accurate when it classifies a sample as Positive but it may classify only some of the positive sample.

Higher the precision, the more confident the model is when it classifies a sample as Positive. The higher the recall, the more positive samples the model correctly classified as Positive.

As the recall increases, the precision decreases. The reason is that when the number of positive samples increases (high recall), the accuracy of classifying each sample correctly decreases (low precision). This is expected, as the model is more likely to fail when there are many samples.

The precision-recall curve makes it easy to decide the point where both the precision and recall are high. The f1 metric measures the balance between precision and recall. When the value of f1 is high, this means both the precision and recall are high. A lower f1 score means a greater imbalance between precision and recall.

The average precision (AP) is a way to summarize the precision-recall curve into a single value representing the average of all precisions. The AP is the weighted sum of precisions at each threshold where the weight is the increase in recall. 

The IoU is calculated by dividing the area of intersection between the 2 boxes by the area of their union. The higher the IoU, the better the prediction.

The mAP is calculated by finding Average Precision(AP) for each class and then average over a number of classes."

Thursday, August 25, 2022

Ruminating on Bloom's Taxonomy

I was trying to help my kids understand the importance of deeply understanding a concept, instead of just remembering facts. 

I found the knowledge pyramid of Bloom an excellent illustration to help my kids understand how to build skills and knowledge. The following article on Vanderbilt University site is a good read to understand the concepts -

Thursday, August 11, 2022

Handling Distributed Transactions in a microservices environment

 In a distributed microservices environment, we do not have complex 2-phase commit transaction managers. 

We need a simpler approach and we have design patterns to address this issue. A good article explaining this strategy is available here -

The crux of the idea is to issue compensating transactions whenever there is a failure in one step of the flow. A good sample usecase for a banking scenario (with source code) is available here -

Wednesday, August 10, 2022

API contract first design

 I have always been a great fan of the 'Contract-First' API design paradigm. The Swagger (OpenAPI) toolset makes it very simple to design new API contracts and object schemas. 

There is a Swagger Pet Store demo available here, wherein we can design API contracts using a simple YAML file: 

After the API contract has been designed and reviewed, we can quickly generate stubs and client code in 20 languages using Swagger Codegen

Wednesday, July 20, 2022

Ruminating on Data Lakehouse

 In my previous blog posts, we had discussed about Data Lakes and Snowflake architecture

Since Snowflake combines the abilities of a traditional data warehouse and a Data Lake, they also market themselves as a Data Lakehouse

Another competing opensource alternative that is headed by the company databricks is called Delta Lake. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes.

A good comparison between Snowflake and databricks Delta Lake is available here:

Enterprises who are embarking on their data platform modernization strategy can ask the following questions to arrive at a best fit choice:

  • Does the data platform have separation of compute from storage? This will enable it to scale horizontally as your data volumes and processing needs increase.
  • Does the data platform support cloud native storage? Most of the cloud native storage services from the hyperscalers (e.g. AWS S3, Google Big Query, Azure Data Lake) have been battle tested for scalability. 
  • What are the usecases you want to run on your data platform? - e.g. Reporting/BI, Streaming Data Analytics, Low latency dashboards, etc. 

Saturday, June 11, 2022

ServiceNow Gekkobrain & Celonis for SAP S4 HANA modernization

One of the fundamental challenges in migrating SAP ECC to S/4 HANA is the lack of knowledge on the plethora of customizations done on legacy SAP ECC. 

To address this challenge, there are two leading products in the market - ServiceNow Gekkobrain and Celonis.

Gekkobrain helps teams in getting a deeper understanding of custom code that was developed to extend the functionality of SAP. It analyses the custom code base and identifies relevant HANA issues and also try to fix the issue. 

Gekkobrain flows is a component that uses ML to understand and visualize the business process. It can understand main flows, sub flows and deviations and help in identifying automation opportunities. This capability of Gekkobrain flows is also branded as a Process Mining Tool. 

ServiceNow acquired Gekkobrain to integrate it with the NOW platform and leverage its workflow engine to automate workflows for customized SAP code (during HANA migrations). 

Another popular tool that is used for during SAP modernization is the popular process mining tool called as Celonis. Celonis is used for retrieving, visualizing and analyzing business processes from transactional data stored by the SAP ERP systems.

Wednesday, June 08, 2022

Ruminating on software defined cloud interconnect (SDCI)

When we connect on-prem applications to cloud services, we have two options - a) Connect to the cloud service provider through the internet  OR b) setup a dedicated network link between your data centre and the cloud provider. 

Setting up a private network connection between your site and the cloud provider enables extremely low latency, increased bandwidth and a more consistent network performance. 

Many enterprises are adopting a multi-cloud strategy and in such cases also, it is imperative that there is no latency between service calls across the cloud providers. Hence it makes sense to leverage a SDCI (software defined cloud interconnect) to connect between the hyperscalers. 

Such networking services are provided by hyperscalers themselves or by dedicated network companies. Given below are links to some of the providers in this space.

Wednesday, June 01, 2022

Ruminating on Data Gravity

Data Gravity is a concept that states that as more and more data gets accumulated (analogy of mass increasing), there would be greater attraction of Applications and Services towards this data. 
There are two factors that accelerate this - Latency and Throughput. The closer the app/service is to the data, the lower is the latency and higher is the throughput. 

Another facet of data gravity is that as large volumes of data gets concentrated in a datastore, it becomes extremely difficult to move that data (to the cloud or another DW).
Hyperscalers understand the concept of data gravity very well and hence are actively pushing for data to move to the cloud. As more and more data gets stored in Cloud DWs, NoSQL stores, RDBMS, it would become easier for developers to leverage hyperscaler PaaS services to build apps/APIs that exploit this data. 

Monday, May 30, 2022

Hash Length of SHA 265

It is important to note that any hash algorithm always returns a hash of fixed length. 
SHA256 returns a hash value that is 256 bits - i.e. 32 bytes. 

These 32 bytes are typically represented as a hexademical string of 64 bytes. 
Hence to store a SHA256 hash, we just need 64 bytes (in database terms - varchar(64)).

Sunday, May 29, 2022

Does Bitcoin encrypt transactions?

 *It is important to note that the core bitcoin network/ledger does not use any encryption. It is all hashing as explained here - 

But each and every transaction is digitally signed by a user (or wallet). 
Hence a bitcoin network uses digital signatures and as we know a digital signature is nothing but an encrypted hash of the data. 

You can digitally sign any data in the following way (creating an encrypted hash of the data)
  • Hash the data using SHA-256.
  • Encrypt the generated hash using your private key. 
  • Package this encrypted hash and your public key together. 
The public key is part of the digital signature, whereas the private key is securely stored in a digital wallet (never to be shared with anyone). 

In the bitcoin network, this digital signature is created using the Elliptic Curve Digital Signature Algorithm (ECDS). ECC encryption is not used! - Many folks get confused by this :)

Ruminating on Elliptic Curve Cryptography

When it comes to symmetric encryption, the most common standards are Data Encryption Standards (DES) and Advanced Encryption Standards (AES).

When it comes to asymmetric encryption (public key cryptography), the dominant standard is  RSA (Rivest-Shamir-Adleman). Almost all the digital certificates (HTTPS/SSL) issued used RSA as the encryption standard and SHA256 as the hashing algorithm. 

Given below is a screenshot of a digital certificate of a random HTTPs site. You can see the encryption algorithm and Hash function mentioned in the certificate. 

There is another asymmetric encryption standard called as ECC (Elliptic Curve Cryptography) that is very popular in the crypto world. 

ECC has the following advantages when compared to RSA:

  • It can run on low end devices (low CPU and memory).
  • It is faster - for both encryption/decryption.
  • Smaller key size - 256-bit elliptic curve private key is just as secure as a 3072-bit RSA private key. Smaller keys are easier to manage and work with.
While certificate issuers have started providing ECC standard based digital certificates, it is important to note that not all browsers (mobile, desktop) still support it. Also a lot of legacy apps may not have support for ECC standard and these have to be refactored for SSL to work again. 

Ruminating on Proof of Work

While I understood how bitcoin mining worked, there were still a few strings loose in my mind on why exactly was the 'Proof-of-Work' algorithm created that way? (i.e. finding a nonce that results in a hash lower than the target value)

The following video from Khan Academy gives an excellent understanding of PoW is very simple terms -

Also the following snippets from Investopedia: would give a better idea on the challenge (finding the right nonce) that all bitcoin miners are trying to solve. 

"A nonce is an abbreviation for "number only used once," which, in the context of cryptocurrency mining, is a number added to a hashed—or encrypted—block in a blockchain that, when rehashed, meets the difficulty level restrictions. The nonce is the number that blockchain miners are solving forWhen the solution is found, the blockchain miners are offered cryptocurrency in exchange.

A target hash is a numeric value that a hashed block header (which is used to identify individual blocks in a blockchain) must be less than or equal to in order for a new block to be awarded to a miner.
The Bitcoin network adjusts the difficulty of mining by raising or lowering the target hash in order to preserve an average 10-minute interval between new blocks.

The block header contains the block version number, a timestamp, the hash used in the previous block, the hash of the Merkle Root, the nonce, and the target hash. The block is generated by taking the hash of the block contents, adding a random string of numbers (the nonce), and hashing the block again.

Determining which string to use as the nonce requires a significant amount of trial-and-error, as it is a random string. A miner must guess a nonce, append it to the hash of the current header, rehash the value, and compare this to the target hash. If the resulting hash value meets the requirements (golden nonce), the miner has created a solution and is awarded the block.

It is highly unlikely that a miner will successfully guess the nonce on the first try, meaning that the miner may potentially test a large number of nonce options before getting it right. The greater the difficulty—a measure of how hard it is to create a hash that is less than the target—the longer it is likely to take to generate a solution."

Ultimately it is all about guessing a nonce and calculating the hash and comparing it. Hence the capacity of a bitcoin mining farm is calculated in terms of hash rate - i.e. number of hashes that can be computed per second. The term 'Bitcoin mining' is actually misleading as what the miners are actually doing is finding a hash that satisfies the challenge and this also validates the transactions in the block (and the block gets added to the chain). 
But as many miners jump on the bandwagon, there is lot of wastage of compute cycles and this is a controversial topic for many people.  

Numeric value of a hash

In the bitcoin network, miners have to compare the hash value during the 'Proof-of-Work' process.

The target hash value is stored in the header and is expressed as a 67-digit number. Miners must find a new hash of the transaction block  that is below the given target. 

To solve the hash puzzle, miners will try to calculate the hash of a block by adding a nonce to the block header repeatedly until the hash value yielded is less than the target.

But how is the value of the hash calculated? In the bitcoin network, the value of a hash is calculated as follows:

  • Hashes are typically represented as a hexadecimal string. Convert the hexadecimal value to decimal value. 
  • Get the base-2 log of the decimal value. 
This is the value that is compared against the target in the block header. 

Thursday, May 26, 2022

Ruminating on dedicated instance vs. dedicated host

 Many folks get confused between the AWS terminology of 'Dedicated Instance' vs 'Dedicated Host'.

A simple way to understand the difference is to remember that a "host" is a physical machine that can host many virtual machine instances. 

Hence a "dedicated host" is a physical machine that is dedicated to your organization. On this physical machine (host), you can install many VMs/containers. So you control what VMs (instances) are going to run on that host. 

So what is a dedicated instance then? A dedicated instance is a virtual machine that runs on hardware that is not shared with other accounts. Dedicated instances are physically isolated at the host hardware level from instances that belong to other AWS accounts. Hence you can only be certain that the underlying hardware that is hosting your VM is not shared with someone else. But you have no fine-grained control over which VM would be launched on which host, etc. 

Tuesday, May 17, 2022

Cloud Native Banking Platform - Temenos

Temenos is the world's number one core banking platform. It is built entirely on the AWS cloud and uses all managed services. 

I was stuck with the simplicity of the overall architecture on AWS and how it enabled elastic scalability to scale-out for peak loads and also scale-back for reducing operational costs. 

Another interesting aspect was the simple implementation of the CQRS pattern to offload queries (read-only API requests) to DynamoDB. The pipeline was built using Kinesis and Lambda. 

An excellent short video on the AWS architecture of the Temenos platform is here:

Sunday, March 27, 2022

Difference between Epoch, Batch and Iterations

In neural nets, we have to specify the number of epochs while we train the model. 

One Epoch is defined as the complete forward & backward pass of the neural net over the complete training dataset. We need to remember that Gradient Descent is an iterative process and hence we need multiple passes (or epochs) to find the optimal solution. In each epoch, the weights and biases are updated. 

Batch size is the number of records in one batch. One Epoch may consist of multiple batches. The training dataset is split into batches because of memory space requirements. A large dataset cannot be fit into memory all at once. With increase in Batch size, required memory space increases. 

Iterations is the number of batches needed to complete one epoch. So if we have 2000 records and a batch size of 500, then we will need 4 iterations to complete one epoch. 

If the number of epochs are low, then it results in underfitting the data. As the number of epochs increases, more number of times the weight are changed in the neural network and the curve goes from underfitting to optimal to overfitting curve. Then how do we determine the optimal number of epochs?

It is done by splitting the training data into 2 subsets - a) 80% for training the neural net  b) 20% for validating the model after each epoch. The fundamental reason we split the dataset into a validation set is to prevent our model from overfitting. The model is trained on the training set, and, simultaneously, the model evaluation is performed on the validation set after every epoch.

Then we use something called as the "early stopping method"-  we essentially keep training the neural net for an arbitrary number of epochs and monitor the performance on the validation dataset after each epoch. When there is no sign of performance improvement on your validation dataset, you should stop training your network. This helps us arrive at the optimal number of epochs. 

A good explanation of these concepts is available here - Found the below illustration really useful to understand the split of data between train/validate/test sets. 

Saturday, March 26, 2022

Ruminating on Convolutional Neural Networks

Convolutional Neural Nets (CNNs) have made Computer Vision a reality. But to understand CNNs, we need to get basics right - What exactly is a convolution? What is a kernel/filter?

The kernel or filter is a small matrix that is multiplied by the source image matrix to extract features. So you can have a kernel that identifies edges or corners of a photo. Then there could be kernels that detect patterns - e.g. eyes, stripes.

A convolution is a mathematical operation where a kernel (aka filter) moves across the input image and does a dot product of the kernel and the original image. This dot product is saved as a new matrix and is called as the feature map. An excellent video visualizing this operation is available here -

Image manipulation software such as Photoshop also use kernels for effects such as 'blur background'. 

One fundamental advantage of the convolution operation is that if a particular filter is designed to detect a specific type of feature in the input, then applying that filter systematically across the entire input image allows us to discover that feature anywhere in the image. Also note that a particular convolutional layer can have multiple kernels/filters. So after the input layer (a single matrix), the convolutional layer (having 6 filters) will produce 6 output matrices. A cool application to visualize this is here -

A suite of tens or even hundreds of other small filters can be designed to detect other features in the image. After a convolutional layer, we also typically add a pooling layer. Pooling layers are used to downsize the features maps - keeping the important parts and discarding the rest. The output matrices of the pooling layer are smaller in size and faster to process. 

So as you can see, the fundamental difference between a densely connected layer and a convolutional layer is that dense fully connected layers learn global patterns (involving all pixels) whereas convolution layers learn local features (edges, corners, textures, etc.) 

Using CNNs, we can create a hierarchy of patterns - i.e. the second layer learns from the first layer. CNNs are also reusable, so we can take an image classification model trained on dataset and add additional layers to customize it for our purpose. 

A good introduction to CNN models is given in this article - with a good PyTorch implementation for MNIST dataset here -

Friday, March 25, 2022

Ruminating on Activation Function

 Activation functions play an important role in neural nets. An activation function transforms the weighted sum of the input into an output from a node. 

Simply put, an activation function defines the output of a neuron given a set of inputs. It is called "activation" to mimic the working of a brain neuron. Our neurons get activated due to some stimulus. Similarly the activation function will decide which neurons in our neural net get activated. 

Each hidden layer of a neural net needs to be assigned an activation function. Even the output layer of a neural net would use an activation function. 

The ReLU (Rectified Linear Unit) function is the most common function used for activation. 

The ReLU function is a simple function: max(0.0, x). So essentially it takes the max of either 0 or x. Hence all negative values are ignored. 

Other activation functions are Sigmoid and Tanh. The sigmoid activation function generates an output value between 0 and 1. 

An excellent video explaining activation function is here -

The activation functions that are typically used for the output layer are Linear, Sigmoid (Logistic) or Softmax. A good explanation of when to use what is available here -

Ruminating on Gradient Descent

Gradient Descent is the most popular optimization algorithm used to train machine learning models - by minimizing the cost function. In deep learning, neural nets use back-propagation that internally use a cost function (aka lost function) like Gradient Descent. 

The Gradient descent function essentially uses calculus to find the direction of travel and then to find the local minimal of a function. The following 2 videos are excellent tutorials to understand Gradient Descent and their use in neural nets.

Once you understand these concepts, it will help you also realize that there is no magic involved when a neural net learns by itself -- ultimately a neural net learning by itself just means minimizing a cost function (aka loss function).

Neural nets start with random values for their weights (of the channels) and biases. Then by using the cost function, these hundreds of weights/biases are shifted towards the optimal value - by using optimization techniques such as gradient descent. 

Ruminating on 'Fitting the line to data'

 In linear regression, we need to find the best fit line over a set of points (data). StatQuest has an excellent video explaining how we fit a line to the data using the principle of 'least squares' -

The best fit line is the one where the sum of the squares of the distances from the actual points to the line is the minimum. Hence this becomes an optimization problem in Maths that can be calculated. We square the distances to take care of negative values/diffs. 

In stats, the optimization function that minimizes the sum of the squared residuals is also called as a 'loss function'. 

The equation of any line can be stated as: y = ax + b

where a is the slope of the line and b is the y-intercept. Using derivates we can find the most optimal values of 'a' and 'b' for a given dataset. 

Wednesday, March 23, 2022

Ruminating on CUDA and TPU

 CUDA  is a parallel computing platform (or a programming model - available as an API) that was developed by Nvidia to enable developers leverage the full parallel processing power of its GPUs. 

For deep learning (training neural nets) requires a humongous amount of processing power and it is here that HPCs with thousands of GPU cores (e.g. A100 GPU) are essential. 

Nvidia has also released a library for use in neural nets called as cuDNN. CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.

Many deep learning frameworks rely on CUDA & the cuDNN library for parallelizing work on  GPUs - Tensorflow, Caffe2,, Keras, PyTorch. 

To address the growing demand for training ML models, Google came up with it's own custom integrated circuit called as TPU (Tensor Processing Unit). A TPU is basically a ASIC (application-specific integrated circuit) designed by Google for use in ML. TPU's are tailored for TensorFlow and can handle massive multiplications and additions for neural networks, at great speeds while reducing the use of too much power and floor space. 

Examples: Google Photos use TPU's to process more than 100 million photos every day. TPU's also power Google RankBrain - that part of Google's algorithm that uses machine-learning and artificial intelligence to better understand the intent of a search query. 

Tuesday, March 22, 2022

Understanding what is a Tensor?

Found this excellent video on YouTube by Prof. Daniel Fleisch that explains tensors in a very simple and engaging way -

Tensors are are generalizations of vectors & matrices to N-dimensional space.

  • A scalar is a 0 dimensional tensor
  • A vector is a 1 dimensional tensor
  • A matrix is a 2 dimensional tensor
  • A nd-array is an N dimensional tensor
The inputs, outputs, and transformations within neural networks are all represented using tensors.  A tensor can be visualized as a container which can house data in N dimensions.

Wednesday, March 09, 2022

Topic Modeling vs. Topic Classification

Topic Modeling is an unsupervised method of infering "topics" or classification tags within a cluster of documents. Whereas topic classification is a supervised ML approach wherein we define a list of topics and label a few documents with these topics for training. 

A wonderful article explaining the differences between both the approaches is here -

Some snippets from the above article:

"Topic modeling involves counting words and grouping similar word patterns to infer topics within unstructured data. By detecting patterns such as word frequency and distance between words, a topic model clusters feedback that is similar, and words and expressions that appear most often. With this information, you can quickly deduce what each set of texts are talking about.

 If you don’t have a lot of time to analyze texts, or you’re not looking for a fine-grained analysis and just want to figure out what topics a bunch of texts are talking about, you’ll probably be happy with a topic modeling algorithm.

However, if you have a list of predefined topics for a set of texts and want to label them automatically without having to read each one, as well as gain accurate insights, you’re better off using a topic classification algorithm."