tag:blogger.com,1999:blog-126697972022-05-23T15:11:18.994+05:30Tech TalkDigital Transformation, Artificial Intelligence, Machine Learning, IoT, Big Data Analytics, Enterprise Architecture, Performance Engineering, Security, Design and Development tips on Java and .NET platforms.Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.comBlogger753125tag:blogger.com,1999:blog-12669797.post-54908059121494027812022-05-17T21:37:00.005+05:302022-05-17T21:47:36.377+05:30Cloud Native Banking Platform - Temenos<p>Temenos is the world's number one core banking platform. It is built entirely on the AWS cloud and uses all managed services. </p><p>I was stuck with the simplicity of the overall architecture on AWS and how it enabled elastic scalability to scale-out for peak loads and also scale-back for reducing operational costs. </p><p>An excellent short video on the AWS architecture of the Temenos platform is here: <a href="https://youtu.be/mtZvA7ARepM">https://youtu.be/mtZvA7ARepM</a></p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-5190430939601705522022-03-27T00:09:00.007+05:302022-03-27T00:11:49.650+05:30Difference between Epoch, Batch and Iterations<p>In neural nets, we have to specify the number of epochs while we train the model. </p><p>One <b>Epoch </b>is defined as the complete forward & backward pass of the neural net over the complete training dataset. We need to remember that <a href="https://www.narendranaidu.com/2022/03/ruminating-on-gradient-descent.html" target="_blank">Gradient Descent </a>is an iterative process and hence we need multiple passes (or epochs) to find the optimal solution. In each epoch, the weights and biases are updated. </p><p><b>Batch size</b> is the number of records in one batch. One Epoch may consist of multiple batches. The training dataset is split into batches because of memory space requirements. A large dataset cannot be fit into memory all at once. With increase in Batch size, required memory space increases. </p><p><b>Iterations </b>is the number of batches needed to complete one epoch. So if we have 2000 records and a batch size of 500, then we will need 4 iterations to complete one epoch. </p><p>If the number of epochs are low, then it results in underfitting the data. As the number of epochs increases, more number of times the weight are changed in the neural network and the curve goes from underfitting to optimal to overfitting curve. <i>Then how do we determine the optimal number of epochs?</i></p><p>It is done by splitting the training data into 2 subsets - a) 80% for training the neural net b) 20% for validating the model after each epoch. The fundamental reason we split the dataset into a validation set is to prevent our model from overfitting. The model is trained on the training set, and, simultaneously, the model evaluation is performed on the validation set after every epoch.</p><p>Then we use something called as the "early stopping method"- we essentially keep training the neural net for an arbitrary number of epochs and monitor the performance on the validation dataset after each epoch. When there is no sign of performance improvement on your validation dataset, you should stop training your network. This helps us arrive at the optimal number of epochs. </p><p>A good explanation of these concepts is available here - <a href="https://www.v7labs.com/blog/train-validation-test-set">https://www.v7labs.com/blog/train-validation-test-set</a>. Found the below illustration really useful to understand the split of data between train/validate/test sets. </p><div class="separator" style="clear: both; text-align: center;"><a href="https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/61568656a13218cdde7f6166_training-data-validation-test.png" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img border="0" data-original-height="640" data-original-width="800" height="320" src="https://assets-global.website-files.com/5d7b77b063a9066d83e1209c/61568656a13218cdde7f6166_training-data-validation-test.png" width="400" /></a></div><br /><p><br /></p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-13468729078343862932022-03-26T03:38:00.003+05:302022-03-26T04:16:49.772+05:30Ruminating on Convolutional Neural Networks<p>Convolutional Neural Nets (CNNs) have made Computer Vision a reality. But to understand CNNs, we need to get basics right - What exactly is a convolution? What is a kernel/filter?</p><p>The kernel or filter is a small matrix that is multiplied by the source image matrix to extract features. So you can have a kernel that identifies edges or corners of a photo. Then there could be kernels that detect patterns - e.g. eyes, stripes.</p><p>A convolution is a mathematical operation where a kernel (aka filter) moves across the input image and does a dot product of the kernel and the original image. This dot product is saved as a new matrix and is called as the feature map. An excellent video visualizing this operation is available here - <a href="https://youtu.be/pj9-rr1wDhM" target="_blank">https://youtu.be/pj9-rr1wDhM</a></p><p>Image manipulation software such as Photoshop also use kernels for effects such as 'blur background'. </p><p>One fundamental advantage of the convolution operation is that if a particular filter is designed to detect a specific type of feature in the input, then applying that filter systematically across the entire input image allows us to discover that feature anywhere in the image. Also note that a particular convolutional layer can have multiple kernels/filters. So after the input layer (a single matrix), the convolutional layer (having 6 filters) will produce 6 output matrices. A cool application to visualize this is here - <a href="https://www.cs.ryerson.ca/~aharley/vis/conv/flat.html" target="_blank">https://www.cs.ryerson.ca/~aharley/vis/conv/flat.html</a></p><p>A suite of tens or even hundreds of other small filters can be designed to detect other features in the image. After a convolutional layer, we also typically add a pooling layer. Pooling layers are used to downsize the features maps - keeping the important parts and discarding the rest. The output matrices of the pooling layer are smaller in size and faster to process. </p><p>So as you can see, the fundamental difference between a densely connected layer and a convolutional layer is that dense fully connected layers learn global patterns (involving all pixels) whereas convolution layers learn local features (edges, corners, textures, etc.) </p><p>Using CNNs, we can create a hierarchy of patterns - i.e. the second layer learns from the first layer. CNNs are also reusable, so we can take an image classification model trained on <a href="https://www.image-net.org/">https://www.image-net.org/</a> dataset and add additional layers to customize it for our purpose. </p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-43986879113695843302022-03-25T04:31:00.004+05:302022-03-25T04:31:32.467+05:30Ruminating on Activation Function<p> Activation functions play an important role in neural nets. An activation function transforms the weighted sum of the input into an output from a node. </p><p>Simply put, an activation function defines the output of a neuron given a set of inputs. It is called "activation" to mimic the working of a brain neuron. Our neurons get activated due to some stimulus. Similarly the activation function will decide which neurons in our neural net get activated. </p><p>Each hidden layer of a neural net needs to be assigned an activation function. Even the output layer of a neural net would use an activation function. </p><p>The ReLU (Rectified Linear Unit) function is the most common function used for activation. </p><p>The ReLU function is a simple function: max(0.0, x). So essentially it takes the max of either 0 or x. Hence all negative values are ignored. </p><p>Other activation functions are Sigmoid and Tanh. The sigmoid activation function generates an output value between 0 and 1. </p><p>An excellent video explaining activation function is here -<a href="https://youtu.be/m0pIlLfpXWE" target="_blank"> https://youtu.be/m0pIlLfpXWE</a></p><p>The activation functions that are typically used for the output layer are Linear, Sigmoid (Logistic) or Softmax. A good explanation of when to use what is available here - <a href="https://machinelearningmastery.com/choose-an-activation-function-for-deep-learning/">https://machinelearningmastery.com/choose-an-activation-function-for-deep-learning/</a></p><p><br /></p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-29362744034653954262022-03-25T02:14:00.004+05:302022-03-25T02:17:06.323+05:30Ruminating on Gradient Descent<p>Gradient Descent is the most popular optimization algorithm used to train machine learning models - by minimizing the cost function. In deep learning, neural nets use back-propagation that internally use a cost function (aka lost function) like Gradient Descent. </p><p>The Gradient descent function essentially uses calculus to find the direction of travel and then to find the local minimal of a function. The following 2 videos are excellent tutorials to understand Gradient Descent and their use in neural nets. </p><p><a href="https://youtu.be/IHZwWFHWa-w" target="_blank">https://youtu.be/IHZwWFHWa-w</a></p><p><a href="https://youtu.be/sDv4f4s2SB8" target="_blank">https://youtu.be/sDv4f4s2SB8</a></p><p>Once you understand these concepts, it will help you also realize that there is no magic involved when a neural net learns by itself -- ultimately <b><i>a neural net learning by itself just means minimizing a cost function (aka loss function).</i></b></p><p>Neural nets start with random values for their weights (of the channels) and biases. Then by using the cost function, these hundreds of weights/biases are shifted towards the optimal value - by using optimization techniques such as gradient descent. </p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-30811206920491474712022-03-25T01:03:00.003+05:302022-03-25T01:08:29.752+05:30Ruminating on 'Fitting the line to data'<p> In linear regression, we need to find the best fit line over a set of points (data). StatQuest has an excellent video explaining how we fit a line to the data using the principle of 'least squares' - <a href="https://www.youtube.com/watch?v=PaFPbb66DxQ">https://www.youtube.com/watch?v=PaFPbb66DxQ</a></p><p>The best fit line is the one where the sum of the squares of the distances from the actual points to the line is the minimum. Hence this becomes an optimization problem in Maths that can be calculated. We square the distances to take care of negative values/diffs. </p><p>In stats, the optimization function that minimizes the sum of the squared residuals is also called as a 'loss function'. </p><p>The equation of any line can be stated as: y = ax + b</p>where a is the slope of the line and b is the y-intercept. Using derivates we can find the most optimal values of 'a' and 'b' for a given dataset. Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-68025689541409914892022-03-23T01:12:00.005+05:302022-03-23T01:12:36.051+05:30Ruminating on CUDA and TPU<p> <a href="https://developer.nvidia.com/about-cuda" target="_blank">CUDA </a> is a parallel computing platform (or a programming model - available as an API) that was developed by Nvidia to enable developers leverage the full parallel processing power of its GPUs. </p><p>For deep learning (training neural nets) requires a humongous amount of processing power and it is here that HPCs with thousands of GPU cores (e.g. A100 GPU) are essential. </p><p>Nvidia has also released a library for use in neural nets called as <a href="https://developer.nvidia.com/cudnn" target="_blank">cuDNN</a>. CUDA Deep Neural Network library (cuDNN) is a GPU-accelerated library of primitives for deep neural networks. cuDNN provides highly tuned implementations for standard routines such as forward and backward convolution, pooling, normalization, and activation layers.</p><p>Many deep learning frameworks rely on CUDA & the cuDNN library for parallelizing work on GPUs - Tensorflow, Caffe2, H2O.ai, Keras, PyTorch. </p><p>To address the growing demand for training ML models, Google came up with it's own custom integrated circuit called as TPU (Tensor Processing Unit). A TPU is basically a ASIC (application-specific integrated circuit) designed by Google for use in ML. TPU's are tailored for TensorFlow and can handle massive multiplications and additions for neural networks, at great speeds while reducing the use of too much power and floor space. </p><p>Examples: Google Photos use TPU's to process more than 100 million photos every day. TPU's also power Google RankBrain - that part of Google's algorithm that uses machine-learning and artificial intelligence to better understand the intent of a search query. </p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-37979781124813050812022-03-22T21:14:00.004+05:302022-03-22T21:14:45.071+05:30Understanding what is a Tensor?<p>Found this excellent video on YouTube by Prof. Daniel Fleisch that explains tensors in a very simple and engaging way - <a href="https://youtu.be/f5liqUk0ZTw">https://youtu.be/f5liqUk0ZTw</a></p><p>Tensors are are generalizations of vectors & matrices to N-dimensional space.</p><p></p><ul style="text-align: left;"><li>A scalar is a 0 dimensional tensor</li><li>A vector is a 1 dimensional tensor</li><li>A matrix is a 2 dimensional tensor</li><li>A nd-array is an N dimensional tensor</li></ul><div>The inputs, outputs, and transformations within neural networks are all represented using tensors. A tensor can be visualized as a container which can house data in N dimensions.</div><div><br /></div><p></p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-20628550852808762362022-03-09T22:53:00.005+05:302022-03-09T22:54:12.188+05:30Topic Modeling vs. Topic Classification<p>Topic Modeling is an unsupervised method of infering "topics" or classification tags within a cluster of documents. Whereas topic classification is a supervised ML approach wherein we define a list of topics and label a few documents with these topics for training. </p><p>A wonderful article explaining the differences between both the approaches is here - <a href="https://monkeylearn.com/blog/introduction-to-topic-modeling/" target="_blank">https://monkeylearn.com/blog/introduction-to-topic-modeling/</a></p><p>Some snippets from the above article:</p><p><i>"Topic modeling involves counting words and grouping similar word patterns to infer topics within unstructured data. By detecting patterns such as word frequency and distance between words, a topic model clusters feedback that is similar, and words and expressions that appear most often. With this information, you can quickly deduce what each set of texts are talking about.</i></p><p><i> If you don’t have a lot of time to analyze texts, or you’re not looking for a fine-grained analysis and just want to figure out what topics a bunch of texts are talking about, you’ll probably be happy with a topic modeling algorithm.</i></p><p><i>However, if you have a list of predefined topics for a set of texts and want to label them automatically without having to read each one, as well as gain accurate insights, you’re better off using a topic classification algorithm."</i></p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-25464780909027519482022-03-09T22:17:00.001+05:302022-03-09T22:17:42.230+05:30Ruminating on the log scale<p>Today, I was trying to explain my colleague on the usecases where a log scale could be more useful than a linear scale and I found this wonderful video that explains this concept like a boss! - <a href="https://youtu.be/eJF9hiv3c-A">https://youtu.be/eJF9hiv3c-A</a></p><p>One of the fundamental advantages of using the log scale is that you can illustrate both small and large values on the same graph. Log scales are also consistent because each unit increase signifies the same multiplier effect - e.g. Log10 has 10* multiplier. </p><p>I also never knew the following scales used popularly are actually log scales:</p><p></p><ul style="text-align: left;"><li>Audio volume (decibel) - 10x increase from 50 decibel to 60 decibel</li><li>Acidity (ph scale)</li><li>Richter scale (earthquakes)</li><li>Radioactivity (measure of radiation)</li></ul><p></p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-69741266023152505692022-03-09T00:40:00.000+05:302022-03-09T00:40:02.214+05:30Ruminating on AI Knowledge Base<p>Many AI systems need a knowledge base that enables logical inference over the knowledge stored in it. This knowledge base is typically encoded using open standards such as RDF (Resource Description Framework). </p><p>To read RDF file, we typically use SPARQL - which is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in RDF. </p><p>A simple Python library to use SPARQL - <a href="https://sparqlwrapper.readthedocs.io/">https://sparqlwrapper.readthedocs.io/</a></p><p>Also some other helper classes if you are not comfortable with SPARQL - <a href="https://github.com/qcri/RDFframes">https://github.com/qcri/RDFframes</a></p><p>Other popular datastores that support RDF formats are <a href="https://graphdb.ontotext.com/" target="_blank">GraphDB</a> and <a href="https://neo4j.com/" target="_blank">Neo4J</a></p><p>The best example of a knowledge base (or graph) is the "Google Knowledge Graph" - a knowledge base used by Google and its services to enhance its search engine's results with information gathered from a variety of sources. The information is presented to users in an infobox next to the search results.</p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-50667937388682923362022-03-08T21:41:00.003+05:302022-03-23T01:13:48.049+05:30Ruminating on Audio Sample Rate and Bit Depth<p> Whenever we are dealing with AI-driven speech recognition, it is important to understand the fundamental concepts of sample rate and bit dept. The following article gives a good overview with simple illustrations. </p><p><a href="https://www.izotope.com/en/learn/digital-audio-basics-sample-rate-and-bit-depth.html">https://www.izotope.com/en/learn/digital-audio-basics-sample-rate-and-bit-depth.html</a></p><p>To convert a sound wave into data, we have to measure the amplitude of the wave at different points of time. The number of samples per second is called as the sample rate. So a sample rate of 16kHz means 16,000 samples were taken in one second. </p><p>The bit depth determines the number of possible amplitude values we can record for each sample - e.g. 16-bit, 24-bit, and 32-bit. With a higher audio bit depth, more amplitude values are available for us to record. The following diagram form the above article will help illustrating this concept. </p><div class="separator" style="clear: both; text-align: center;"><div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/a/AVvXsEg9-Gk2ukf5Tl-F0jnOQO1vFBvJPwUWEgC_KgXHha_jpJwc7HHu2uTrfTLpGo59AxQimhzO4GrwbBRrVNBDSG9zY-9xS0jzwlSTKqqJVXdf9q7ydTsdOzPDZnM2dQyb7_r7psgkhWr9Bf3g20tHRpRwCSudLY_wsp72hqhXT9gt0pPiVk7d-wg" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"><img alt="" data-original-height="334" data-original-width="590" height="252" src="https://blogger.googleusercontent.com/img/a/AVvXsEg9-Gk2ukf5Tl-F0jnOQO1vFBvJPwUWEgC_KgXHha_jpJwc7HHu2uTrfTLpGo59AxQimhzO4GrwbBRrVNBDSG9zY-9xS0jzwlSTKqqJVXdf9q7ydTsdOzPDZnM2dQyb7_r7psgkhWr9Bf3g20tHRpRwCSudLY_wsp72hqhXT9gt0pPiVk7d-wg=w445-h252" width="445" /></a></div><br /></div>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-44077792873130136982022-03-08T20:29:00.001+05:302022-03-08T20:29:13.726+05:30Free audiobooks<p>The below site has more than 16,000 audio books that are completely free for everyone!</p><p><a href="https://librivox.org/">https://librivox.org/</a></p><p>Would recommend everyone to check this out. </p><p><br /></p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-12701404817221707382022-03-08T19:38:00.002+05:302022-03-08T19:38:25.479+05:30Ruminating on the Turing Test<p> The Turing Test was made famous by Alan Turing in the year 1950. The Turing Test essentially tests a computer's ability to communicate indistinguishably from a human. The Turing Test was also called as the 'Imitation Game' by Alan earlier. </p><p>A good introduction to the Turing Test can be found here - <a href="https://youtu.be/3wLqsRLvV-c">https://youtu.be/3wLqsRLvV-c</a></p><p>Though many claim that the turing test was passed by a AI chatbot called Eugene Goostman, but in reality it is not so. No computer has ever passed the Turing Test - <a href="https://isturingtestpassed.github.io/">https://isturingtestpassed.github.io/</a></p><p>Intelligent chatbots have really come a long way - The Google Duplex Demo e.g. <a href="https://youtu.be/0YaAFRirkfk">https://youtu.be/0YaAFRirkfk</a></p><p>Maybe when we achieve AGI (Artificial General Intelligence), then the Turing Test would be accurately passed :)</p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-4188446951929538752022-03-06T20:32:00.002+05:302022-03-06T20:32:22.757+05:30Ruminating on n-gram models<p>N-gram is a fundamental concept in NLP and is used in many language models. In simple terms, N-gram is nothing but a sequence on N words - e.g. San Francisco (is a 2-gram) and The Three Musketeers (is a 3-gram). </p><p>N-grams are very useful because they can be used for making next word predictions, correcting spellings or grammar. Ever wondered how Gmail is able to suggest auto-completion of sentences? This is possible because Google has created a language model that can predict next words.</p><p>N-grams are also used for correcting spelling errors - e.g. “drink cofee” could be corrected to “drink coffee” because the language model can predict that 'drink' and 'coffee' being together have a high probability. Also the 'edit distance' between 'cofee' and 'coffee' is 1, hence it is a typo.</p><p>Thus N-grams are used to create probabilistic language models called n-gram models. N-gram models predict the occurrence of a word based on its N – 1 previous word. </p><p>The 'N' depends on the type of analysis we want to do - e.g. Research has also shown that trigrams and 4-grams work the best for spam filtering. </p><div>Some good info on N-grams is available at the Standford University site - <a href="https://web.stanford.edu/~jurafsky/slp3/slides/LM_4.pdf">https://web.stanford.edu/~jurafsky/slp3/slides/LM_4.pdf</a></div><div><br /></div><div>Google books also has a "N-Gram" viewer displays a graph showing how those phrases have occurred in a corpus of books (e.g., "British English", "English Fiction", "French") over the selected years. I found this to be useful in understanding what topic was popular in which years: <a href="https://books.google.com/ngrams">https://books.google.com/ngrams</a></div><div><br /></div>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-70053042939577173652022-03-06T18:52:00.003+05:302022-03-06T18:52:33.745+05:30Ruminating on Text Normalization<p>Text normalization is the process of converting text to a standard form before we use them for training AI NLP models. The following techniques are typically used for normalizing text. </p><p><b>Tokenization</b>: Tokenization is the process of breaking down sentences into words. In many Latin-derived languages, "space" is considered to be a word delimeter. But there are special cases such as 'New York', 'Rock-n-Roll' etc. Also Chinese and Japanese languages do not have spaces between words. We may laso wante to tokenize emoticons and hashtags. </p><p><b>Lemmatization</b>: In this process, we check if words have the same 'root' - e.g. sings, sang. We then normalize the words to the common root word. </p><p><b>Stemming</b>: Stemming can be considered a form of Lemmatization wherein we just strip the suffixes from the end of the word - e.g. troubled and troubles can be stemmed to 'troubl'.</p><p>Lemmatization is more computationally intensive than Stemming because it actually maps the word to a dictionary and finds the root word. Whereas Stemming just uses some crude heuristic process that chops off the ends of words in the hope of getting the root word. Stemming is thus much faster when you are dealing with a large corupus of text. The following examples will make the difference clear. </p><p></p><ul style="text-align: left;"><li>The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.</li><li>The word "walk" is the base form for word "walking", and hence this is matched in both stemming and lemmatisation.</li><li>If you lemmatize the word 'Caring', it would return 'Care'. If you stem, it would return 'Car' and this is erroneous.</li></ul><p></p><p><b>Sentence Segmentation</b>: This entails breaking up a long sentence into smaller sentences using chars such as '! ; ?'. </p><p><b>Spelling Correction and UK/US differences</b>: As part of the normalization activity, we may also want to correct some common spelling mistakes and also normalize the different spelling between UK/US like neighbour/neighbor.</p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-18019326142686838632022-02-22T02:32:00.004+05:302022-02-22T02:35:04.549+05:30Ruminating on Fourier Transformation<p> In my AI learning journey, I had to refresh my understanding of fourier transformation. The following video on YouTube is hands-down the best tutorial for understanding this maths concept.</p><p>Fourier transformations are used in AI Computer Vision models for usecases such as edge detection, image filtering, image reconstruction, and image compression. </p><div class="separator" style="clear: both; text-align: center;"><iframe allowfullscreen="" class="BLOG_video_class" height="266" src="https://www.youtube.com/embed/spUNpyF58BY" width="535" youtube-src-id="spUNpyF58BY"></iframe></div><br /><p><br /></p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-1545583947330497882022-02-02T23:25:00.001+05:302022-02-02T23:25:17.175+05:30Fuzzy matching of Strings<p>Quite often during document processing or email processing tasks, we need to compare strings or search for keywords. While the traditional way of doing this would be using String comparison or RegEx, there are a number of other techniques available. </p><p>Fuzzy matching is an approximate string matching technique that is typically used to identify typos or spelling mistakes. Fuzzy matching algorithms try to measure how close two strings are to one another using a concept called as '<a href="https://en.m.wikipedia.org/wiki/Edit_distance" target="_blank">Edit Distance</a>'. In simple words, 'edit distance' can be considered as the number of edits required to make both the sentences same. There are different types of edit distance measurements as described in the Wikipedia article above. </p><p><a href="https://github.com/seatgeek/thefuzz" target="_blank">TheFuzz</a> is a cool Python library that can be used to measure the Levenshtein Distance between sequences. The following articles would help you quickly grasp the basics of using the library. </p><p><a href="https://www.activestate.com/blog/how-to-implement-fuzzy-matching-in-python/">https://www.activestate.com/blog/how-to-implement-fuzzy-matching-in-python/</a></p><p><a href="https://www.analyticsvidhya.com/blog/2021/07/fuzzy-string-matching-a-hands-on-guide/">https://www.analyticsvidhya.com/blog/2021/07/fuzzy-string-matching-a-hands-on-guide/</a></p><p><a href="https://towardsdatascience.com/fuzzy-string-matching-in-python-68f240d910fe">https://towardsdatascience.com/fuzzy-string-matching-in-python-68f240d910fe</a></p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-58005761555484061502022-01-14T02:34:00.002+05:302022-01-14T02:41:40.724+05:30Ruminating on Snowflake Architecture<p> The following video is an excellent tutorial to understand how Snowflake can perform both as a Data Lake and Datawarehouse.</p><p><a href="https://www.youtube.com/watch?v=jmVnZPeClag">https://www.youtube.com/watch?v=jmVnZPeClag</a></p><p>The following articles on Snowflake are also worth a perusal:</p><p><a href="https://www.snowflake.com/workloads/data-warehouse-modernization/">https://www.snowflake.com/workloads/data-warehouse-modernization/</a></p><p><a href="https://www.snowflake.com/guides/data-lake">https://www.snowflake.com/guides/data-lake</a></p><p>The following key concepts are important to understand to appreciate how Snowflake works:</p><p></p><ul style="text-align: left;"><li>Snowflake separates compute with storage and each can be scaled out independently</li><li>For storage, Snowflake leverages distributed cloud storage services like AWS S3, Azure Blob, Google Cloud Storage). This is cool since these services are already battle-tested for reliability, scalability and redundancy. Snowflake compresses the data in these cloud storage buckets. </li><li>For compute, Snowflake has a concept called as "Virtual warehouse". A virtual warehouse is a simple bundle of compute (CPU) and memory (RAM) with some temperory storage. All SQL queries are executed in the virtual warehouse. </li><li>Snowflake can be queried using plain simple SQL - so no specialized skills required. </li><li>If a query is fired more frequently, then the data is cached in memory. This "Cache" is the magic that enables fast ad-hoc queries to be run against the data. </li><li>Snowflake enables a unified data architecture for the enterprise since it can be used as a Data Lake as well as a Data warehouse. The 'variant' data type can store JSON and this JSON can also be queried. </li></ul><div>The virtual datawarehouse provide a kind of dynamic scalability to the Snowflake DW. Snippets from the <a href="https://docs.snowflake.com/en/user-guide/warehouses-overview.html" target="_blank">Snowflake documentation</a>.</div><div><br /></div><div><i>"The number of queries that a warehouse can concurrently process is determined by the size and complexity of each query. As queries are submitted, the warehouse calculates and reserves the compute resources needed to process each query. If the warehouse does not have enough remaining resources to process a query, the query is queued, pending resources that become available as other running queries complete. If queries are queuing more than desired, another warehouse can be created and queries can be manually redirected to the new warehouse. In addition, resizing a warehouse can enable limited scaling for query concurrency and queuing; however, warehouse resizing is primarily intended for improving query performance. </i></div><div><i>With multi-cluster warehouses, Snowflake supports allocating, either statically or dynamically, additional warehouses to make a larger pool of compute resources available". </i></div><p></p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-43464698966552787412022-01-09T05:20:00.002+05:302022-01-09T05:20:24.923+05:30Ruminating on Joint Probability vs Conditional Probability vs Marginal Probability<p>The concept of conditional probability is a very important to understand Bayesian networks. An excellent introduction to these concepts is available in this video - <a href="https://youtu.be/5s7XdGacztw">https://youtu.be/5s7XdGacztw</a></p><p>As we know, probability is calculated as the number of desired outcomes divided by the total possible outcomes. Hence if we roll a dice, the probability that it would be 4 is 1/6 ~ (P = 0.166 = 16.66%)</p><p>Siimilary, the probability of an event not occurring is called as the complement ~ (1-P). Hence the probability of not rolling a 4 would = 1-0.166 = 0.833 ~ 83.33%</p><p>While the above is true for a single variable, we also need to understand how to calculate the probability of two or more variables - e.g. probability of lightening and thunder happening together when it rains. </p><p>When two or more variables are involved, then we have to consider 3 types of probability:</p><p>1) <b>Joint probability</b> calculates the likelihood of two events occurring together and at the same point in time. For example, the joint probability of event A and event B is written formally as: P(A and B) or P(A ^ B) or P(A, B)</p><p>2) <b>Conditional probability</b> measures the probability of one event given the occurrence of another event. It is typically denoted as P(A given B) or P(A | B). For complex problems involving many variables, it is difficult to calculate joint probability of all possible permutations and combinations. Hence conditional probability becomes a useful and easy technique to solve such problems. Please check the video link above. </p><p>3) <b>Marginal probability</b> is the probability of an event irrespective of the outcome of another variable.</p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-67341843280068656172022-01-04T02:42:00.003+05:302022-01-04T02:42:17.317+05:30Confusion Matrix for Classification Models<p>Classification models are supervised ML models used to classify information into various classes - e.g. binary classification (true/false) or multi-class classification (facebook/twitter/whatsapp)</p><p>When it comes to classification models, we need a better metric than accuracy for evaluating the holistic performance of the model. The following article gives an excellent overview of Confusion Matrix and how it can be used to evaluate classification models (and also tune their performance).</p><p><a href="https://www.analyticsvidhya.com/blog/2020/04/confusion-matrix-machine-learning/">https://www.analyticsvidhya.com/blog/2020/04/confusion-matrix-machine-learning/</a></p><p>Some snippets from the above article:</p><p><i>A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. This gives us a holistic view of how well our classification model is performing and what kinds of errors it is making.</i></p><p><i></i></p><div class="separator" style="clear: both; text-align: center;"><i><a href="https://lh3.googleusercontent.com/-EuNgJfwP8ts/YdNksB-tk1I/AAAAAAABtg4/H0C1G_zn2PQG4oA5luL-OO9gkGZvpeIYACNcBGAsYHQ/image.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="280" data-original-width="396" height="226" src="https://lh3.googleusercontent.com/-EuNgJfwP8ts/YdNksB-tk1I/AAAAAAABtg4/H0C1G_zn2PQG4oA5luL-OO9gkGZvpeIYACNcBGAsYHQ/image.png" width="320" /></a></i></div><div class="separator" style="clear: both; text-align: left;"><i>A binary classification model will have false positives (aka Type 1 error) and false negatives (aka Type 2 error). A good example would be a ML model that predicts whether a person has COVID based on symptoms and the confusion matrix would look something like the below. </i></div><div class="separator" style="clear: both; text-align: left;"><i><br /></i></div><div class="separator" style="clear: both; text-align: left;"><div class="separator" style="clear: both; text-align: center;"><i><a href="https://lh3.googleusercontent.com/-Ix9_P7143fw/YdNlLcq9xWI/AAAAAAABthA/0BFGWpKylAIXR4JRMMhTHRfVB1HVzJ1xACNcBGAsYHQ/image.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="268" data-original-width="363" height="236" src="https://lh3.googleusercontent.com/-Ix9_P7143fw/YdNlLcq9xWI/AAAAAAABthA/0BFGWpKylAIXR4JRMMhTHRfVB1HVzJ1xACNcBGAsYHQ/image.png" width="320" /></a></i></div><i><br />Based on the confusion matrix, we can calculate other metrics such as 'Precision' and 'Recall'. </i></div><div class="separator" style="clear: both; text-align: left;"><i>Precision is a useful metric in cases where False Positive is a higher concern than False Negatives.</i></div><p></p><div class="separator" style="clear: both;"><i>Recall is a useful metric in cases where False Negative trumps False Positive.</i></div><div class="separator" style="clear: both;"><div class="separator" style="clear: both;"><i>Recall is important in medical cases where it doesn’t matter whether we raise a false alarm but the actual positive cases should not go undetected!</i></div></div><p><i></i></p><div class="separator" style="clear: both; text-align: left;"><i>F1-score is a harmonic mean of Precision and Recall, and so it gives a combined idea about these two metrics. It is maximum when Precision is equal to Recall.</i></div><i><br />Another illustration of a multi-class classification confusion matrix that predicts the social media channel. </i><p></p><p><i></i></p><div class="separator" style="clear: both; text-align: center;"><i><a href="https://lh3.googleusercontent.com/-oMT9Bb0_Byg/YdNmnSKqwbI/AAAAAAABthI/E8aall61n_87gnkBZrKOLttbXTZZIu-kgCNcBGAsYHQ/image.png" style="margin-left: 1em; margin-right: 1em;"><img alt="" data-original-height="347" data-original-width="396" height="261" src="https://lh3.googleusercontent.com/-oMT9Bb0_Byg/YdNmnSKqwbI/AAAAAAABthI/E8aall61n_87gnkBZrKOLttbXTZZIu-kgCNcBGAsYHQ/w298-h261/image.png" width="298" /></a></i></div><p></p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-54484172564880015932022-01-03T19:53:00.005+05:302022-01-03T19:53:58.705+05:30Bias and Variance in MLBefore we embark on machine learning, it is important to understand basic concepts around bias, variance, overfitting and underfitting. <div><br /></div><div>The below video from StatQuest gives an excellent overview of these concepts: <a href="https://youtu.be/EuBBz3bI-aA">https://youtu.be/EuBBz3bI-aA</a></div><div><br /></div><div>Another good article explaining the concepts is here - <a href="https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229">https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229</a></div><div><br /></div><div>Bias as the difference (aka error) between the average prediction made by the ML model and the real data in the training set. The bias shows how well the model matches the training dataset. A low bias model will closely match the training data set. </div><div><br /></div><div>Variance refers to the amount by which the predictions would change if we fit the model to a different training data set.<i> </i>A low variance model will produce consistent predictions across different datasets. </div><div><i><br /></i></div><div>Ideally, we would want a <i>model with both a low bias and low variance</i>. But we often need to do a <i>trade-off between bias and variance. Hence we need to find a sweet spot between a simple model and a complex model. </i></div><div><b>Overfitting </b>means your model has a low bias but a high variance. It overfits the training dataset. </div><div><b>Underfiting </b>means your model has a high bias but a low variance. </div><div>If our model is too simple and has very few parameters then it may have high bias and low variance (underfitting). On the other hand if our model has large number of parameters then it’s going to have high variance and low bias (overfitting). So we need to find the right/good balance without overfitting and underfitting the data.</div>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-279805492169128072022-01-03T19:07:00.000+05:302022-01-03T19:07:13.107+05:30Ruminating on high dimensional data<p> In simple terms, dimensions of a dataset refer to the number of attributes (or features) that a dataset has. This concept of dimensions of data is not new and quite common in the data warehousing world as explained <a href="https://www.narendranaidu.com/2005/12/what-are-multi-dimensional-databases.html" target="_blank">here</a>. </p><p>Many datasets can have a large number of features (variables/attributes) such as healthcare data, signal processing, bioinformatics. When the number of dimensions are staggeringly high, <i>ML calculations become extremely difficult</i>. It is also possible that the number of features can exceed the number of observations (or records in a dataset) - e.g. microarrays, which measure gene expression, can contain hundreds of samples/records, but each record can contain tens of thousands of genes.</p><p>In such highly dimensional data, we experience something called as the "<b>Curse of dimensionality</b>" - i.e. all records appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient. The more dimensions we add to a data set, the more sparse the data becomes and this results in an exponential decrease in the ML model performance (i.e. predictive capabilities). </p><p>A typical rule of thumb is that there should be at least 5 training examples for each dimension in the dataset. Another interesting <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality" target="_blank">excerpt </a>from Wikipedia is given below: </p><p><i>In machine learning and insofar as predictive performance is concerned, the curse of dimensionality is used interchangeably with the peaking phenomenon, which is also known as Hughes phenomenon. This phenomenon states that with a fixed number of training samples, the average (expected) predictive power of a classifier or regressor first increases as the number of dimensions or features used is increased but beyond a certain dimensionality it starts deteriorating instead of improving steadily.</i></p><p>To handle high dimensional datasets, data scientists typically perform various data dimension reduction techniques on the datasets - e.g. feature selection, feature projection, etc. More information about dimension reduction can be found <a href="https://www.analyticsvidhya.com/blog/2018/08/dimensionality-reduction-techniques-python/" target="_blank">here</a>. </p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-11304262172251213332021-12-30T02:32:00.002+05:302021-12-30T02:32:29.685+05:30Ruminating on ONNX format<p>Open Neural Network Exchange (<a href="https://onnx.ai/index.html" target="_blank">ONNX</a>) is an open standard format for representing machine learning models. While there are proprietary formats such as pickle (for Python) and MOJO (for H20 AI), there was a need to drive interoperability. </p><p>ONNX defines a common set of operators - the building blocks of machine learning and deep learning models - and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers. ONNX also provides a definition of an extensible computation graph model and each computation dataflow graph is structured as a list of nodes that form an acyclic graph. Nodes have one or more inputs and one or more outputs. Each node is a call to an operator. The entire source code of the standard is available here - <a href="https://github.com/onnx/">https://github.com/onnx/</a></p><p><i>Thus ONNX enables an open ecosystem for interoperable AI models.</i> The <a href="https://github.com/onnx/models" target="_blank">ONNX Model Zoo</a> is a collection of pre-trained, state-of-the-art models in the ONNX format that can be easily reused in a plethora of AI frameworks. All popular AI frameworks such as TensorFlow, CoreML, Caffe2, PyTorch, Keras, etc. support ONNX. </p><p>There are also opensource tools that enable us to convert existing models into ONNX format - <a href="https://github.com/onnx/onnxmltools">https://github.com/onnx/onnxmltools</a></p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0tag:blogger.com,1999:blog-12669797.post-90905878417006774022021-12-30T02:09:00.002+05:302021-12-30T02:09:15.116+05:30ML Data Preparation - Impute missing values<p> In any AI project, data plumbing/engineering takes up 60-80% of the effort. Before training a model on the input dataset, we have to cleanse and standarize the dataset.</p><p>One common challenge is around missing values (features) in the dataset. We can either ignore these records or try to "impute" the value. The word "impute" means to assign a value to something by inference. </p><p>Since most AI models cannot work with blanks or NaN values (for numerical features), it is important to impute values of missing features in a recordset. </p><p>Given below are some of the techniques used to impute the value of missing features in the dataset. </p><p></p><ul style="text-align: left;"><li><b>Numerical features</b>: We can use the MEAN value of the feature or the MEDIAN value. </li><li><b>Categorical features</b>: For categorical features, we can use the most frequent value in the training dataset or we can use a constant like 'NA' or 'Not Available'. </li></ul><p></p>Narendra Naiduhttp://www.blogger.com/profile/14883940950404721626noreply@blogger.com0