Tech Talk

Tuesday, March 08, 2022

Ruminating on Audio Sample Rate and Bit Depth

Whenever we are dealing with AI-driven speech recognition, it is important to understand the fundamental concepts of sample rate and bit dept. The following article gives a good overview with simple illustrations.

https://www.izotope.com/en/learn/digital-audio-basics-sample-rate-and-bit-depth.html

To convert a sound wave into data, we have to measure the amplitude of the wave at different points of time. The number of samples per second is called as the sample rate. So a sample rate of 16kHz means 16,000 samples were taken in one second.

The bit depth determines the number of possible amplitude values we can record for each sample - e.g. 16-bit, 24-bit, and 32-bit. With a higher audio bit depth, more amplitude values are available for us to record. The following diagram form the above article will help illustrating this concept.

Free audiobooks

The below site has more than 16,000 audio books that are completely free for everyone!

https://librivox.org/

Would recommend everyone to check this out.

Ruminating on the Turing Test

The Turing Test was made famous by Alan Turing in the year 1950. The Turing Test essentially tests a computer's ability to communicate indistinguishably from a human. The Turing Test was also called as the 'Imitation Game' by Alan earlier.

A good introduction to the Turing Test can be found here - https://youtu.be/3wLqsRLvV-c

Though many claim that the turing test was passed by a AI chatbot called Eugene Goostman, but in reality it is not so. No computer has ever passed the Turing Test - https://isturingtestpassed.github.io/

Intelligent chatbots have really come a long way - The Google Duplex Demo e.g. https://youtu.be/0YaAFRirkfk

Maybe when we achieve AGI (Artificial General Intelligence), then the Turing Test would be accurately passed :)

Sunday, March 06, 2022

Ruminating on n-gram models

N-gram is a fundamental concept in NLP and is used in many language models. In simple terms, N-gram is nothing but a sequence on N words - e.g. San Francisco (is a 2-gram) and The Three Musketeers (is a 3-gram).

N-grams are very useful because they can be used for making next word predictions, correcting spellings or grammar. Ever wondered how Gmail is able to suggest auto-completion of sentences? This is possible because Google has created a language model that can predict next words.

N-grams are also used for correcting spelling errors - e.g. “drink cofee” could be corrected to “drink coffee” because the language model can predict that 'drink' and 'coffee' being together have a high probability. Also the 'edit distance' between 'cofee' and 'coffee' is 1, hence it is a typo.

Thus N-grams are used to create probabilistic language models called n-gram models. N-gram models predict the occurrence of a word based on its N – 1 previous word.

The 'N' depends on the type of analysis we want to do - e.g. Research has also shown that trigrams and 4-grams work the best for spam filtering.

Some good info on N-grams is available at the Standford University site - https://web.stanford.edu/~jurafsky/slp3/slides/LM_4.pdf

Google books also has a "N-Gram" viewer displays a graph showing how those phrases have occurred in a corpus of books (e.g., "British English", "English Fiction", "French") over the selected years. I found this to be useful in understanding what topic was popular in which years: https://books.google.com/ngrams

Ruminating on Text Normalization

Text normalization is the process of converting text to a standard form before we use them for training AI NLP models. The following techniques are typically used for normalizing text.

Tokenization: Tokenization is the process of breaking down sentences into words. In many Latin-derived languages, "space" is considered to be a word delimeter. But there are special cases such as 'New York', 'Rock-n-Roll' etc. Also Chinese and Japanese languages do not have spaces between words. We may laso wante to tokenize emoticons and hashtags.

Lemmatization: In this process, we check if words have the same 'root' - e.g. sings, sang. We then normalize the words to the common root word.

Stemming: Stemming can be considered a form of Lemmatization wherein we just strip the suffixes from the end of the word - e.g. troubled and troubles can be stemmed to 'troubl'.

Lemmatization is more computationally intensive than Stemming because it actually maps the word to a dictionary and finds the root word. Whereas Stemming just uses some crude heuristic process that chops off the ends of words in the hope of getting the root word. Stemming is thus much faster when you are dealing with a large corupus of text. The following examples will make the difference clear.

The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.
The word "walk" is the base form for word "walking", and hence this is matched in both stemming and lemmatisation.
If you lemmatize the word 'Caring', it would return 'Care'. If you stem, it would return 'Car' and this is erroneous.

Sentence Segmentation: This entails breaking up a long sentence into smaller sentences using chars such as '! ; ?'.

Spelling Correction and UK/US differences: As part of the normalization activity, we may also want to correct some common spelling mistakes and also normalize the different spelling between UK/US like neighbour/neighbor.

Tuesday, February 22, 2022

Ruminating on Fourier Transformation

In my AI learning journey, I had to refresh my understanding of fourier transformation. The following video on YouTube is hands-down the best tutorial for understanding this maths concept.

Fourier transformations are used in AI Computer Vision models for usecases such as edge detection, image filtering, image reconstruction, and image compression.

Wednesday, February 02, 2022

Fuzzy matching of Strings

Quite often during document processing or email processing tasks, we need to compare strings or search for keywords. While the traditional way of doing this would be using String comparison or RegEx, there are a number of other techniques available.

Fuzzy matching is an approximate string matching technique that is typically used to identify typos or spelling mistakes. Fuzzy matching algorithms try to measure how close two strings are to one another using a concept called as 'Edit Distance'. In simple words, 'edit distance' can be considered as the number of edits required to make both the sentences same. There are different types of edit distance measurements as described in the Wikipedia article above.

TheFuzz is a cool Python library that can be used to measure the Levenshtein Distance between sequences. The following articles would help you quickly grasp the basics of using the library.

https://www.activestate.com/blog/how-to-implement-fuzzy-matching-in-python/

https://www.analyticsvidhya.com/blog/2021/07/fuzzy-string-matching-a-hands-on-guide/

https://towardsdatascience.com/fuzzy-string-matching-in-python-68f240d910fe

Friday, January 14, 2022

Ruminating on Snowflake Architecture

The following video is an excellent tutorial to understand how Snowflake can perform both as a Data Lake and Datawarehouse.

https://www.youtube.com/watch?v=jmVnZPeClag

The following articles on Snowflake are also worth a perusal:

https://www.snowflake.com/workloads/data-warehouse-modernization/

https://www.snowflake.com/guides/data-lake

The following key concepts are important to understand to appreciate how Snowflake works:

Snowflake separates compute with storage and each can be scaled out independently
For storage, Snowflake leverages distributed cloud storage services like AWS S3, Azure Blob, Google Cloud Storage). This is cool since these services are already battle-tested for reliability, scalability and redundancy. Snowflake compresses the data in these cloud storage buckets.
For compute, Snowflake has a concept called as "Virtual warehouse". A virtual warehouse is a simple bundle of compute (CPU) and memory (RAM) with some temperory storage. All SQL queries are executed in the virtual warehouse.
Snowflake can be queried using plain simple SQL - so no specialized skills required.
If a query is fired more frequently, then the data is cached in memory. This "Cache" is the magic that enables fast ad-hoc queries to be run against the data.
Snowflake enables a unified data architecture for the enterprise since it can be used as a Data Lake as well as a Data warehouse. The 'variant' data type can store JSON and this JSON can also be queried.

The virtual datawarehouse provide a kind of dynamic scalability to the Snowflake DW. Snippets from the Snowflake documentation.

"The number of queries that a warehouse can concurrently process is determined by the size and complexity of each query. As queries are submitted, the warehouse calculates and reserves the compute resources needed to process each query. If the warehouse does not have enough remaining resources to process a query, the query is queued, pending resources that become available as other running queries complete. If queries are queuing more than desired, another warehouse can be created and queries can be manually redirected to the new warehouse. In addition, resizing a warehouse can enable limited scaling for query concurrency and queuing; however, warehouse resizing is primarily intended for improving query performance.

With multi-cluster warehouses, Snowflake supports allocating, either statically or dynamically, additional warehouses to make a larger pool of compute resources available".

Sunday, January 09, 2022

Ruminating on Joint Probability vs Conditional Probability vs Marginal Probability

The concept of conditional probability is a very important to understand Bayesian networks. An excellent introduction to these concepts is available in this video - https://youtu.be/5s7XdGacztw

As we know, probability is calculated as the number of desired outcomes divided by the total possible outcomes. Hence if we roll a dice, the probability that it would be 4 is 1/6 ~ (P = 0.166 = 16.66%)

Siimilary, the probability of an event not occurring is called as the complement ~ (1-P). Hence the probability of not rolling a 4 would = 1-0.166 = 0.833 ~ 83.33%

While the above is true for a single variable, we also need to understand how to calculate the probability of two or more variables - e.g. probability of lightening and thunder happening together when it rains.

When two or more variables are involved, then we have to consider 3 types of probability:

1) Joint probability calculates the likelihood of two events occurring together and at the same point in time. For example, the joint probability of event A and event B is written formally as: P(A and B) or P(A ^ B) or P(A, B)

2) Conditional probability measures the probability of one event given the occurrence of another event. It is typically denoted as P(A given B) or P(A | B). For complex problems involving many variables, it is difficult to calculate joint probability of all possible permutations and combinations. Hence conditional probability becomes a useful and easy technique to solve such problems. Please check the video link above.

3) Marginal probability is the probability of an event irrespective of the outcome of another variable.

Another excellent article explaining conditional probability with real life examples is here - https://www.investopedia.com/terms/c/conditional_probability.asp

Tuesday, January 04, 2022

Confusion Matrix for Classification Models

Classification models are supervised ML models used to classify information into various classes - e.g. binary classification (true/false) or multi-class classification (facebook/twitter/whatsapp)

When it comes to classification models, we need a better metric than accuracy for evaluating the holistic performance of the model. The following article gives an excellent overview of Confusion Matrix and how it can be used to evaluate classification models (and also tune their performance).

https://www.analyticsvidhya.com/blog/2020/04/confusion-matrix-machine-learning/

Some snippets from the above article:

A Confusion matrix is an N x N matrix used for evaluating the performance of a classification model, where N is the number of target classes. The matrix compares the actual target values with those predicted by the machine learning model. This gives us a holistic view of how well our classification model is performing and what kinds of errors it is making.

A binary classification model will have false positives (aka Type 1 error) and false negatives (aka Type 2 error). A good example would be a ML model that predicts whether a person has COVID based on symptoms and the confusion matrix would look something like the below.

Based on the confusion matrix, we can calculate other metrics such as 'Precision' and 'Recall'.

Precision is a useful metric in cases where False Positive is a higher concern than False Negatives.

Recall is a useful metric in cases where False Negative trumps False Positive.

Recall is important in medical cases where it doesn’t matter whether we raise a false alarm but the actual positive cases should not go undetected!

F1-score is a harmonic mean of Precision and Recall, and so it gives a combined idea about these two metrics. It is maximum when Precision is equal to Recall.

Another illustration of a multi-class classification confusion matrix that predicts the social media channel.

Monday, January 03, 2022

Bias and Variance in ML

Before we embark on machine learning, it is important to understand basic concepts around bias, variance, overfitting and underfitting.

The below video from StatQuest gives an excellent overview of these concepts: https://youtu.be/EuBBz3bI-aA

Another good article explaining the concepts is here - https://towardsdatascience.com/understanding-the-bias-variance-tradeoff-165e6942b229

Bias as the difference (aka error) between the average prediction made by the ML model and the real data in the training set. The bias shows how well the model matches the training dataset. A low bias model will closely match the training data set.

Variance refers to the amount by which the predictions would change if we fit the model to a different training data set. A low variance model will produce consistent predictions across different datasets.

Ideally, we would want a model with both a low bias and low variance. But we often need to do a trade-off between bias and variance. Hence we need to find a sweet spot between a simple model and a complex model.

Overfitting means your model has a low bias but a high variance. It overfits the training dataset.

Underfiting means your model has a high bias but a low variance.

If our model is too simple and has very few parameters then it may have high bias and low variance (underfitting). On the other hand if our model has large number of parameters then it’s going to have high variance and low bias (overfitting). So we need to find the right/good balance without overfitting and underfitting the data.

Ruminating on high dimensional data

In simple terms, dimensions of a dataset refer to the number of attributes (or features) that a dataset has. This concept of dimensions of data is not new and quite common in the data warehousing world as explained here.

Many datasets can have a large number of features (variables/attributes) such as healthcare data, signal processing, bioinformatics. When the number of dimensions are staggeringly high, ML calculations become extremely difficult. It is also possible that the number of features can exceed the number of observations (or records in a dataset) - e.g. microarrays, which measure gene expression, can contain hundreds of samples/records, but each record can contain tens of thousands of genes.

In such highly dimensional data, we experience something called as the "Curse of dimensionality" - i.e. all records appear to be sparse and dissimilar in many ways, which prevents common data organization strategies from being efficient. The more dimensions we add to a data set, the more sparse the data becomes and this results in an exponential decrease in the ML model performance (i.e. predictive capabilities).

A typical rule of thumb is that there should be at least 5 training examples for each dimension in the dataset. Another interesting excerpt from Wikipedia is given below:

In machine learning and insofar as predictive performance is concerned, the curse of dimensionality is used interchangeably with the peaking phenomenon, which is also known as Hughes phenomenon. This phenomenon states that with a fixed number of training samples, the average (expected) predictive power of a classifier or regressor first increases as the number of dimensions or features used is increased but beyond a certain dimensionality it starts deteriorating instead of improving steadily.

To handle high dimensional datasets, data scientists typically perform various data dimension reduction techniques on the datasets - e.g. feature selection, feature projection, etc. More information about dimension reduction can be found here.

Thursday, December 30, 2021

Ruminating on ONNX format

Open Neural Network Exchange (ONNX) is an open standard format for representing machine learning models. While there are proprietary formats such as pickle (for Python) and MOJO (for H20 AI), there was a need to drive interoperability.

ONNX defines a common set of operators - the building blocks of machine learning and deep learning models - and a common file format to enable AI developers to use models with a variety of frameworks, tools, runtimes, and compilers. ONNX also provides a definition of an extensible computation graph model and each computation dataflow graph is structured as a list of nodes that form an acyclic graph. Nodes have one or more inputs and one or more outputs. Each node is a call to an operator. The entire source code of the standard is available here - https://github.com/onnx/

Thus ONNX enables an open ecosystem for interoperable AI models. The ONNX Model Zoo is a collection of pre-trained, state-of-the-art models in the ONNX format that can be easily reused in a plethora of AI frameworks. All popular AI frameworks such as TensorFlow, CoreML, Caffe2, PyTorch, Keras, etc. support ONNX.

There are also opensource tools that enable us to convert existing models into ONNX format - https://github.com/onnx/onnxmltools

ML Data Preparation - Impute missing values

In any AI project, data plumbing/engineering takes up 60-80% of the effort. Before training a model on the input dataset, we have to cleanse and standarize the dataset.

One common challenge is around missing values (features) in the dataset. We can either ignore these records or try to "impute" the value. The word "impute" means to assign a value to something by inference.

Since most AI models cannot work with blanks or NaN values (for numerical features), it is important to impute values of missing features in a recordset.

Given below are some of the techniques used to impute the value of missing features in the dataset.

Numerical features: We can use the MEAN value of the feature or the MEDIAN value.
Categorical features: For categorical features, we can use the most frequent value in the training dataset or we can use a constant like 'NA' or 'Not Available'.

Wednesday, December 29, 2021

Computer Vision Use Cases

viso.ai has published a series of interesting posts, where they list down a number of potential usecases of computer vision (CV) across various industries.

Posting a few links below:

https://viso.ai/applications/computer-vision-in-manufacturing/

https://viso.ai/applications/computer-vision-in-retail/

https://viso.ai/applications/computer-vision-in-healthcare/

Jotting down a few usecases, where we see traction in the industry:

Quality inspection: CV can be used to inspect the final OEM product for quality defects - e.g. cracks on the surface, missing component in a PCB, dents on a metal surface, etc. Google has launched a specialized service for this called as Visual Inspection AI.
Process monitoring: CV can ensure that proper guidelines are being followed during the manufacturing process - e.g. In meat plants, are knives steralized after each cut?, Are healthcare workers wearing helmets and hand gloves? Do workers keep the tools back in the correct location? You can also measure the time spent in specific factory floor areas to implement 'Lean Manufacturing'.
Worker safety: Is a worker immobile for quite some time (potential accident?), Are they too many workers in a dangerous factory area? Are workers wearing proper protective gear? During Covid times, monitor the social distancing between workers and ensure compliance.
Inventory management: Using drones (fitted with HD cameras) or even a simple smartphone, CV can be used to count the inventory of items in a warehouse. They can also be used for detecting unused space in the warehouse and automate many tasks in the supply chain.
Predictive maintenance: Deep learning has been used to find cracks in industrial components such as spherical tanks and pressure vessels.
Patient monitoring during surgeries: Just have a look at the cool stuff done by Gauss Surgical (https://www.gausssurgical.com/) With AI-enabled, real-time blood loss monitoring, Triton identifies four times more hemorrhages.
Machine-assisted Diagnosis: CV can help radiologists in more accurate diagnosis of ailments and detect patterns and minor changes far better than humans. Siemens Healthineers is doing a lot of interesting AI work to assist radiologists. More info can be found here -- https://www.siemens-healthineers.com/digital-health-solutions/artificial-intelligence-in-healthcare
Home-based patient monitoring: Detect human fall scenarios in old patients - at home or in elderly care centers.
Retail heat maps: Test new merchandising strategies and experiment with various layouts.
Customer behavior analysis: Track how much time a user has spent looking at a display item. What is the conversion rate?
Loss prevention: Detect potential theft scenarios - e.g. suspicious behavior.
Cashierless retail store: Amazon Go !
Virtual mirrors: Equivalent of SnapChat filters where you can virtually try the new fashion.
Crowd managment: Detect unsual crowds in public spaces and proactively take action.

A huge number of interesting CV case studies are also available on roboflow's site here - https://blog.roboflow.com/tag/case-studies/

Thursday, December 23, 2021

Ruminating on Stochastic vs Deterministic Models

The below article is an excellent read on the difference between stochastic forecasting and deterministic forecasting.

https://blog.ev.uk/stochastic-vs-deterministic-models-understand-the-pros-and-cons

Snippets from the article:

"A Deterministic Model allows you to calculate a future event exactly, without the involvement of randomness. If something is deterministic, you have all of the data necessary to predict (determine) the outcome with certainty.

Most financial planners will be accustomed to using some form of cash flow modelling tool powered by a deterministic model to project future investment returns. Typically, this is due to their simplicity. But this is fundamentally flawed when making financial planning decisions because they are unable to consider ongoing variables that will affect the plan over time.

Stochastic models possess some inherent randomness - the same set of parameter values and initial conditions will lead to an ensemble of different outputs. A stochastic model will not produce one determined outcome, but a range of possible outcomes, this is particularly useful when helping a customer plan for their future.

By running thousands of calculations, using many different estimates of future economic conditions, stochastic models predict a range of possible future investment results showing the potential upside and downsides of each."

Tuesday, December 14, 2021

Ruminating on Memory Mapped Files

Today, both Java and .NET have full support for memory mapped files. Memory mapped files are not only very fast for read/write operations, but they also serve another important function --- they can be shared between different processes. This enables us to use memory mapped files as shared data store for IPC.

The following excellent articles by Microsoft and MathWorks explains how to use memory mapped files.

https://docs.microsoft.com/en-us/dotnet/standard/io/memory-mapped-files

https://www.mathworks.com/help/matlab/import_export/overview-of-memory-mapping.html

Snippets from the above articles:

A memory-mapped file contains the contents of a file in virtual memory. This mapping between a file and memory space enables an application, including multiple processes, to modify the file by reading and writing directly to the memory.

Memory-mapped files can be shared across multiple processes. Processes can map to the same memory-mapped file by using a common name that is assigned by the process that created the file.

Accessing files via memory map is faster than using I/O functions such as fread and fwrite. Data is read and written using the virtual memory capabilities that are built in to the operating system rather than having to allocate, copy into, and then deallocate data buffers owned by the process.

Memory-mapping works best with binary files, and in the following scenarios:

For large files that you want to access randomly one or more times
For small files that you want to read into memory once and access frequently
For data that you want to share between applications
When you want to work with data in a file as if it were an array

Due to limits set by the operating system, the maximum amount of data you can map with a single instance of a memory map is 2 gigabytes on 32-bit systems, and 256 terabytes on 64-bit systems.

Ruminating on TTS and ASR (STT)

Advances in deep neural networks have helped us build far more accurate voice bots. While adding a voice channel to a chatbot, there are two important services that are required:

1) TTS (text to Speech)

2) ASR/STT (Automatic Speech Recognition or Speech to Text)

For TTS, there are a lot of commerical cloud services available ---- few examples given below:

Similary for speech recognition, there are many commercial services available:

If you are looking for open source solutions for TTS & STT, then the following open source projects look promising:

MaryTTS: http://mary.dfki.de/
Kaldi Speech recognition - https://kaldi-asr.org/doc/
SoX sound processing - http://sox.sourceforge.net/
Open Source Wrapper API - https://github.com/codeforequity-at/botium-speech-processing
NVIDIA TTS - https://github.com/NVIDIA/tacotron2
NVIDIA Speech Synthesis - https://github.com/NVIDIA/waveglow
Mozilla TTS - https://github.com/mozilla/TTS
Mozilla Speech to Text - https://github.com/mozilla/DeepSpeech
Facebook STT - https://github.com/flashlight/wav2letter

A few articles that explain the setup:

Friday, September 24, 2021

Ruminating on R&D investment capitalization

When organizations invest in R&D activities, there are accounting rules for claiming tax credits. The reason governments give tax credits is to encourage innovation thorough R&D spending.

Earlier, the accounting rules mandated all R&D investments to be shown as "expenses" on the Profit and Loss (P&L) statement in the same year, the investments were made. But this poses a problem as the return of a R&D investment could be over multiple future years. Also R&D investments can vary widely every year and its net profit can be significantly higher or lower because of the timing of R&D spending.

Hence in the new accounting rules, instead of treating R&D investments as an "expense" in the P&L statement, it is treated as an "Asset" in the balance sheet. This is referred to as "Capitalizing the R&D expense".

Once the R&D investment is treated like an "Asset", the company can claim amortization (or depreciation) benefits on it over multiple years. For example - A software firm has invested 1 MM USD in R&D to build a product in 2021. The economic life of the built product is 5 years. The company can then amortize/depreciate the capitalized R&D expenses equally over the five-year life. This amortized amout per year is a line item in the yearly P&L statement.

A good explanation of this concept is given here - https://corporatefinanceinstitute.com/resources/knowledge/accounting/capitalizing-rd-expenses/

Wednesday, September 08, 2021

Ruminating on Systems Thinking

Systems Thinking is an approach towards problem solving that emphasizes the fact that most systems are complex and do not have a linear cause-effect relationship. We need to view the problem as part of a wider dynamic system and understand how different parts interact or influence one other.

The crux of systems thinking is the ability to zoom out and look at the big picture (holistic view) - to understand the various factors or perspectives that play a role.

By steering away from linear thinking into systems thinking, we will make better decisions in all aspects of business. System thinking also helps us to think in terms of ecosystems and not just be confined by traditional boundaries.

The following video is an excellent tutorial on Systems Thinking - https://youtu.be/GPW0j2Bo_eY

Other good links showing examples of system thinking:

- Use of DDT for mosquitos: https://thesystemsthinker.com/making-the-jump-to-systems-thinking/

- Systems thinking for water services - e.g. understanding the impact of rainfall, educating customers to improve the quality of wastewater, external factors such as new building development, rising river levels, etc : https://www.waterindustryjournal.co.uk/united-utilities-rolls-out-systems-thinking-approach-to-wastewater-network-management

Wednesday, September 01, 2021

Ruminating on Data Drift and Concept Drift

Quite often, the performance (accuracy of prediction) of a AI model degrades with time. One of the reasons this happens is due to a phenomenon called as "Data Drift".

So what exactly is Data Drift?

Data Drift can be defined as any change to the structure, semantics or statistical properties of data - i.e. model input data.

Changes to structure of data: New fields being added or old field deleted. This could happen because of a new upgrade to a upstream system, a new sensor, etc.
Changes to semantics of data: A new upstream system is sending temperature in F and not C.
Changes to statistical properties of data: Changes to the atmospheric pressure threshold levels due to environmental changes. There could be also data quality issues such as a bug in upstream system that delivers junk data.

To maintain the accuracy of our AI models, it is imperative that we measure and monitor Data Drift. Our machine learning infrastructure needs to have tools that automatically detect data drift and can pin-point the features that are causing the drift.

Changes to the underlying statistical properties of data is also called as "Concept Drift". A classic example of this is the current pandemic. The "behaviour" or "concept" has changed after the pandemic - e.g. models that predict the amount of commute time are no longer valid. Models that forecasted the number of the cosmetic surgeries in 2021 are no longer valid.

Most of the hyperscalers provide services that enable us to monitor data drift and take proactive actions. The below links provide some examples of cloud services for data drift:

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-monitor-datasets?tabs=python

https://aws.amazon.com/sagemaker/model-monitor/

https://cloud.google.com/blog/topics/developers-practitioners/event-triggered-detection-data-drift-ml-workflows

Thursday, February 04, 2021

Process Discovery vs. Process Mining

The first step in automation is to gain a complete understanding of the current state business processes. Often enterprises do not have a knowledge base of all their existing processes and hence identifying automation opportunities becomes a challenge.

There are two techniques to address this challenge:

Process Discovery

Process Discovery tools typically record an end-user's interactions with the business applications. This recording is then intelligently transformed into process maps, BPMN diagrams and other technical documentation. The output from the Process Discovery tool can also be used to serve as a foundation for building automation bots. Given below are some examples of Process Discovery tools:

Process Mining

Process Mining is all about extracting knowledge from event data. Today all IT systems and applications emit event logs that can be captured and analysed centrally using tools such as Splunk or ELK stack. As one can imagine, the biggest drawback of Process Mining is that it would not work if your legacy IT systems do not emit logs/events.

By analysing event data, Process Mining tools can model business process and also capture information on how the process performs in terms of latency, cost and other KPIs (from the log/event data).

Given below are some examples of Process Mining tools:

Tuesday, February 02, 2021

Ticket Triaging with AI

The below link is a comprehensive article that introduces "ticket traiging" and also explains how AI can be used to automate this important step.

https://monkeylearn.com/blog/how-to-implement-a-ticket-triaging-system-with-ai/

Some snippets from the article:

"Automated ticket triaging involves evaluating and directing support tickets to the right person, quickly and effectively, and even sorting tickets in order of importance, topic or urgency. Triage powered by Natural Language Processing (NLP) technology is well-equipped to understand support ticket content to ensure the right ticket gets to the right agent.

NLP can process language like a human, by reading between the lines and detecting language variations, to make sense of text data before it categorizes it and allocates it to a customer support agent. The big advantages of using NLP is that it can triage faster, more accurately and more objectively than a human, making it a no-brainer for businesses when deciding to switch from manual triage to auto triage."

Wednesday, December 16, 2020

Ruminating on JAB (Java Access Bridge)

Quite often, we need to build automation around Java desktop apps. The default MS UI automation framework would not be able to identify Java UI components (based on Swing).

Hence we need to use a bridge..and this technology is called as JAB (Java Access Bridge). Using the JAB dll on windows, you can access Java client UI controls and build your automation use-case. All commercial tools such as UiPath, AA also use JAB behind the scenes for Java desktop automation.

Google has also released a cool tool called "Access Bridge Explorer" that can allows exploring, as well as interacting with, the Accessibility tree of any Java applications that uses the Java Access Bridge to expose their accessibility features,

Wednesday, October 28, 2020

Finding the .NET version on windows

In Java, you can execute a simple command "java -version", but unfortunately it is not so straightforward in .NET.

The below stackoverflow thread shows some commands that can be leveraged to find out the .NET version - https://stackoverflow.com/questions/1565434/how-do-i-find-the-net-version

The commands which worked for me on Win10 are as follows:

Command Prompt (cmd):

dir /b /ad /o-n %systemroot%\Microsoft.NET\Framework\v?.*

The above command will list down all versions of .NET except v4.5 and above.

reg query "HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\NET Framework Setup\NDP\v4\full" /v version

The above command will work for .NET versions v4.5 and above.

PowerShell:

gci 'HKLM:\SOFTWARE\Microsoft\NET Framework Setup\NDP' -recurse | gp -name Version,Release -EA 0 | where { $_.PSChildName -match '^(?!S)\p{L}'} | select PSChildName, Version, Release

The above powershell command is the most versatile and will list down all versions.

Saturday, October 24, 2020

Ruminating on Ontology

Ontology is nothing but the structure of knowledge (a language to represent knowledge). The primary objective of an Ontology is to describe the knowledge about a domain (concepts and relationships).

In data science, ontologies can be used to create a semantic layer on top of existing data to make existing organizational data FAIR (Findable, Accessible, Interoperable and Reusable).

Protege (https://protege.stanford.edu/) is an excellent free opensource tool to create ontologies. A fun example of a pizza ontology can be found here.

The Web Ontology Language (OWL) is a suite of knowledge representation languages for authoring ontologies. For UML lovers, there is also a OWL profile for UML. A good example of using the OWL profile for UML is here - https://www.researchgate.net/figure/A-first-sample-ontology-depicted-using-the-UML-profile-for-OWL-ontologies_fig4_228867697

Friday, October 02, 2020

Ruminating on Automation (RPA) Security Risks

Intelligent automation & RPA can drive operational efficiencies at organizations and help boost the productivity of enterprise resources. But there is also a risk of cyber-attacks as bots introduce a new attack surface for hackers.

Without proper measures, enterprises may face increased risk exposure due to bots. The following recommendations would enable organizations to mitigate the risk of such security attacks.

Secure Vault: All credentials required by the bot to execute tasks on applications should be securely stored in a Vault (e.g. CyberArk or HashiCorp Vault). This ensures that the target application credentials are not stored by the bot and only accessed at runtime by the bot from the vault.
Least Privilege Access: bots should not be given a blanket access to perform all operations, but should be given only appropriate access as required for the automation usecases - e.g. many automation usecases would entail 'read-only' access to databases/applications.
Selecting appropriate Automation use-cases: While down-selecting automation usecases, it would be good to have 'Security Risk' as an parameter for assessment. If a bot needs admin access across multiple applications to perform critical business functions, then the organization can decide to NOT automate this usecase and handover the case to a knowledge worker. The bot can enable this smooth transition to humans (via a workflow or case management tool).
Change Bot passwords/secure keys: As a security best practice, change the passwords and secure keys for the bots (and machines where bots run) regularly (e.g. once a month).
Security Testing of Bots: Ultimately bots are also software components and we need to make sure that the bots undergo both static code security analysis and runtime security testing.
Audit Trail & Proactive Monitoring: The automation framework should provide a detailed audit log of all bot activities. Each and every step executed by the bot should be available for forensic audit if required. Proactive monitoring of this audit log can also be automated to quickly alert users of any anomaly pattern or security breach.
Governance Framework - Last but not the least, it is important to setup a proper governance framework for bot lifecycle management. The governance framework should clearly define the roles and responsibilities and the proper process to be followed for the entire bot lifecycle.

Thursday, August 27, 2020

Automating API creation over existing data-sources

If you need to quickly build REST APIs over existing data in excel sheets, RDBMS, NoSQL data stores; then the DreamFactory toolkit is immensely helpful.

DreamFactory is available under the Apache 2 license and you can host it on your public/private cloud platforms. Just with a few clicks, you will have a complete secure API service available to perform CRUD operations on your data. DreamFactory runs on the Apache LAMP stack ((Linux, Apache, MySQL, PHP/Perl/Python).

Besides generating the APIs, DreamFactory can also dynamically generate a Client SDK for HTML5 frameworks like jQuery, AngularJS, and Sencha, and a code library for native clients like iOS, Windows 8, and Android.

DreamFactory also has a toolkit to convert your SOAP webservices into REST APIs - https://www.dreamfactory.com/resources/video/how-turn-any-soap-web-service-rest-api/

It also has a cool feature called DataMesh wherein we can create virtual foreign key relationships between tables in the same database or between completely different databases without altering your schema or writing any code. Create, read, update, or delete objects and related objects with a single API call.

DreamFactory also automatically creates Swagger API documentation and has basic API Management capabilities built in (e.g. rate limiting)

Displaying PDF, Word, PPT docs on web pages

If you need to display documents in PDF, Word or PPT formats on your web application, then the following JavaScript toolkit will come to your rescue.

ViewerJS - https://viewerjs.org/

ViewerJS uses 2 other JS libraries behind the scenes - PDF.js (http://mozilla.github.io/pdf.js/) and WebODF (https://webodf.org/)

Any document that is following the Open Document Format (ODF) can be rendered using ViewerJS. All Office 365 files now follow the ODF format.

Wednesday, August 12, 2020

Ruminating on the hype around hyper-automation

So what exactly is hyper-automation?

In simple terms 'Hyper-Automation' is going beyond plain RPA. In RPA, the bot just mimics what a human would do.

But in hyper-automation, along with the RPA tool, we use other technologies like AI/ML, workflow engine, rules engine to automate E2E process flows. Thus we make the business process more streamlined and robust ---- essentially better than what the human was doing.

Even Process Discovery Tools are considered to be an important part of the hyper-automation journey. Process Discovery Tools enable us to understand the current state business processes and offers suggestions on streamlining it. It does not make sense to automate a fragmented process, but we should first streamline the process, remove redundancies and then automate it.

Example - A bot reading an unstructured email and understanding its intent, extracting relevant data points from it (AI) and completing the requested transaction via a desktop app (RPA).