Thursday, June 20, 2024

Calculating tokens for words

For LLM applications, we often use embedding models like ada-002 or davinci models. While using these models, we need to often estimate the number of tokens that would required for our application. 

For the English language, a good thumb rule is that 3 to 4 chars make up a token. 

A nifty online tool that can help you estimate the number of tokens is: https://www.quizrise.com/token-counter


Wednesday, June 19, 2024

Database Manager - DBGate

 If you are looking for a web-based database manager that is open-source and commercial friendly, then please have a look at DBGate: https://github.com/dbgate/dbgate

Since it is web-based, you can see a good demo here: https://demo.dbgate.org/


Tuesday, June 11, 2024

#no-build for JS: Embracing the Era of Import Maps and HTTP/2

Complex build processes have dominated the web development world for years. JavaScript module bundling, transpiling, and management have become critical tasks for tools such as Webpack. However, what if there was an easier method? 

It is time for us to "un-learn" old tricks and rid ourselves of the baggage of old knowledge! Modern browsers support HTTP/2 and "import maps" for javascript modules and this is an absolute game changer!

In the past, web browsers had trouble comprehending contemporary JavaScript features and had trouble loading many small files quickly. Webpack and other build tools addressed these problems by:

  • Bundling: Putting several JavaScript files together into a single, bigger file to cut down on HTTP requests.
  • Transpiling:transforming current JavaScript syntax into an earlier iteration that works with an earlier browser.

Although these technologies were helpful to us, they increased complexity:
  • Build Configuration: Build procedures frequently cause disruptions to the development workflow, necessitating continuous rebuilding and waiting. 
  • Development Workflow Disruption: Build configurations can be time-consuming to set up and manage.

But all modern browsers (Chrome, Firefox, Safari, Edge) support HTTP/2 and "Import Maps" for JS modules. 
  • Import maps enables us to use alias for full JS module paths - it is a method of creating a central registry that associates module names with their real locations is. This makes things clearer and more versatile by removing the requirement for you to write whole file paths in your code (across multiple files). 
  • HTTP/2 is a more effective and rapid method of sending data across the internet. It makes it unnecessary to bundle files because browsers can handle several tiny files well. Instead of opening many connections to the server (like having many waiters running around), HTTP/2 uses one connection. Thus, multiple JS files can be downloaded concurrently, speeding up page load times.

Friday, June 07, 2024

Should we serve JS files from a CDN?

An excellent article that states why we need to actually use a CDN for serve popular JS libraries:  https://blog.wesleyac.com/posts/why-not-javascript-cdn

Excerpt from the article: 

"This means that one of these CDNs going down, or an attacker hacking one of them would have a huge impact all over society — we already see this category of problem with large swaths of the internet going down every time cloudflare or AWS has an outage."

Thursday, May 09, 2024

Ruminating on Core Web Vitals

Have you ever clicked on a webpage only to spend a long time staring at a blank screen? Yes, it is frustrating. That's bad for the website (since you could just click away) and bad for you (because you're waiting).

This is where Core Web Vials are useful -- they are a set of metrics defined by Google to measure a website’s loading speed, interactivity and visual stability.  In essence, there are three things that websites must do well in order to ensure that users have a positive experience. 

  • Quick loading (also known as Largest Contentful Paint, or LCP): This refers to how quickly a webpage's major content loads. A decent load time is defined as 2.5 seconds or less. 
  • Smooth interactions (First Input Delay or FID): This is about how responsive a webpage feels. If you click a button and nothing happens for a while, that's a sign of bad FID. We want those clicks to feel instant, just like if you were pushing a real button. A decent speed is one that is less than 100 milliseconds.
  • Stable visuals (Cumulative Layout Shift or CLS): This one's about how much the content on a webpage jumps around as it loads. Imagine you start reading a recipe and then all the ingredients suddenly jump to different places on the page - that's bad CLS! We want the content to stay put so you can focus on what you're reading. A score of under 1.0 is good.
More information on Core Web Vitals can be found here -- https://developers.google.com/search/docs/appearance/core-web-vitals



The top recommentations to improve the core web vitals are as follows:
  • Optimize Image size and Image loading: Large, poorly optimized pictures are a primary cause of sluggish loading speeds.  Reducing the size of your pictures, using compression techniques, and turning on lazy loading—which loads images only when the user scrolls—will all help you get a higher Largest Contentful Paint (LCP) score. Also implement lazy loading - i.e. make images load only when visitors scroll down to them. 
  • Caching:  Enable browser caching to store website elements on a user's device.  This way, the browser doesn't have to download everything all over again each time they visit your site. Leveraging a CDN would also help here. 
  • JS & CSS optimization: Minify and compress your CSS and Javascript files.  This can significantly reduce their size and improve loading times.
  • Preload: Preloading instructs the browser to fetch specific resources early, even before they're explicitly requested by the page. You can preload resources using the <link rel="preload"> tag in the <head> section of your HTML document.

Monday, January 22, 2024

Gaia-X & Catena-X: Data usage governance and Sovereignty

Unless you have been living under a rock, you must have heard about Gaia-X. The whole premise of Gaia-X was to build a fair and open data ecosystem for Europe, where everyone can share information securely and control who gets to use it. It's like a big marketplace for data, but with European values of privacy and control at its heart.

The core techical concepts that we need to understand around Gaia-X are "Data Spaces" and "Connectors".  

Data Spaces refer to secure and trusted environments where businesses, individuals, and public institutions can share and use data in a controlled and fair manner. They're like marketplaces for data, but with strict rules to ensure privacy, security, and data sovereignty (meaning individuals and companies retain control over their data).

A connector plays a crucial role in facilitating secure and controlled data exchange between different participants within the ecosystem. Think of it as a translator and bridge builder, helping diverse systems and providers communicate and share data seamlessly and safely. The Eclipse foundation has taken a lead on this and created the Eclipse Dataspace Component (EDC) initiative wherein many opensource projects have been created to build Gaia-X compliant connectors. 

These core concepts of Dataspaces and Connectors can also be used to build a modern data architecture in a federated decentralized manner. An excellent article on this approach is available on AWS here - https://aws.amazon.com/blogs/publicsector/enabling-data-sharing-through-data-spaces-aws/

An offshoot of Gaia-X is another initiative called Catena-X that aims to create a data ecosystem for the automotive industry. It aims to create a standardized way for car manufacturers, suppliers, dealers, software providers, etc. – to share information securely and efficiently through usage of standard data formats and procotols. The Eclipse Tractus-X™ project is the official open-source project in the Catena-X ecosystem under the umbrella of the Eclipse Foundation and has reference implementations of connectors to securely exchange data. 


But how do you ensure that the data is used only for the purpose that you allowed it to be used? Can you have legal rights and controls over how the data is used after you have shared it? This is the crux of the standards around Gaia-x/Catena X. 

At the heart lies the data usage contract, a legally binding agreement between data providers and consumers within the Catena-X ecosystem. This contract specifies the exact terms of data usage, including:

  • Who can access the data?: Defined by roles and permissions within the contract.
  • What data can be accessed?: Specific data sets or categories permitted.
  • How the data can be used?: Allowed purposes and restrictions on analysis, processing, or sharing.
  • Duration of access?: Timeframe for using the data.

Contracts establish a basic link between the policies and the data to be transferred; a transfer cannot occur without a contract. 

Because of the legal binding nature of this design, all users of data are required to abide by the usage policies just like they would with a handwritten contract. 

More details around data governance can be found in the official white paper --https://catena-x.net/fileadmin/_online_media_/231006_Whitepaper_DataSpaceGovernance.pdf

Besides contracts, every data access and usage event is logged on a distributed ledger, providing a transparent audit trail for accountability and dispute resolution. The connectors also enforce proper authentication/authorization through the Identity Provider and validate other policy rules. 

Thus Gaia-X/Catena-X enforce data usage policies through a combination of legal contracts, automated technical tools, independent verification, and a strong legal framework. This multi-layered approach ensures trust, transparency, and accountability within the data ecosystem, empowering data providers with control over their valuable information.

Tuesday, December 05, 2023

Generating synthetic data

 Faker is an excellent tool for generating mock data for your application. But any complex application would have tens of tables with complex relationships between them. How can we use Faker to populate all of these tables? 

We can follow two approaches here:

Option 1: Create the primary table first and then the dependent tables. Then when populating the dependent tables, you can refer to a random primary key from the first table. A good article summarizing this is here -- https://khofifah.medium.com/how-to-generate-fake-relational-data-in-python-and-getting-insight-using-sql-in-bigquery-985c5adc87d3

Code snippet: 
 #generate relational user id in account table and transaction table
trans['user_id']=random.choices(account["id"], k=len(trans))

Option 2: Use a ORM framework to insert data into the database. An ORM framework would make it easy to establish relationships between different tables. A good article on this approach is here - https://medium.com/@pushpam.ankit/generating-mock-data-for-complex-relational-tables-with-python-and-sqlite-2950ab7700f2

Another interesting opensource tool is "Synthetic Data Vault" https://sdv.dev/
In these tools, we first train the tool on real data and then use the AI model for generation of new synthetic data. Many vendors differentiate between "mock" and "synthetic" data on this aspect. 

Sunday, November 05, 2023

Fine-tuning vs RAG for LLMs

Large language models (LLMs) have revolutionized the field of natural language processing (NLP), enabling state-of-the-art performance on a wide range of tasks, including text classification, translation, summarization, and generation. 

When it comes to usecases around leveraing LLMs for extracting insights from our own knowledge repositories, we have broadly two design approaches:

  • Fine-tuning a LLM
  • RAG (Retrieval Augmented Generation)



Many fields have their own specialised terminology. This vocabulary may be missing from the common pretraining data utilised by LLMs. 

Fine Tuning a LLM
The process of fine-tuning a pre-trained LLM on a fresh domain-specific dataset is known as fine-tuning. Fine-tuning is the process of further training a previously trained LLM on a smaller, domain-specific, labelled dataset.
To fine-tune an LLM, you'll need a dataset of labelled data, with each data point representing an input and output pair. A written passage, a query, or a code snippet might be used as input. The result might be a label, a summary, a translation, or code.
Once you have a dataset, you can use a supervised learning method to fine-tune the LLM. By minimising a loss function, the algorithm will learn to map the input to the output.
It can be computationally costly to fine-tune an LLM.

Another subset of the above approach is called PEFT (Parameter-efficient fine-tuning) and LoRA is the most popular approach for PEFT today.  
LoRA (Low-Rank Adaptation of Large Language Models) is a fine-tuning approach for LLMs that is more efficient and memory-efficient than standard fine-tuning. Traditional fine-tuning entails altering all of an LLM's parameters. This can be computationally expensive and memory-intensive, particularly for big LLMs with billions of parameters.LoRA, on the other hand, merely modifies a few low-rank matrices. Because of this, LoRA is far more efficient and memory-efficient than conventional fine-tuning.

An excellent article explaining the concepts of full fine-tuning and LoRA is here -- https://deci.ai/blog/fine-tuning-peft-prompt-engineering-and-rag-which-one-is-right-for-you/

RAG (Retrieval Augmented Generation)
RAG is an effective strategy for improving the performance and relevance of LLMs by combining "prompt engineering" with "context retrieval" from external data sources.
Given below is a high level process flow for RAG. 
  1. All documents from a domain specific knoweldge source are converted into embeddings and stored in a special vector database. These vector embeddings are nothing but “N-dimensional matrices” of numbers.
  2. When the user types his query, even the query is converted into an embedding (matrix of numbers) using a AI model.
  3. Semantic search techniques are used to identify all contextual sentences in the “document-embedding” for the given query. Most popular algorithm is “cosine similarity”. This algorithm uses a cosine maths function to get all the sentences (matrices) that are ‘near’ or ‘close’ to the query (matrix). This entails matrix multiplications and other maths functions. 
  4. All retrieved “semantically similar sentences/paragraphs” from multiple documents are finally again sent to a LLM for ‘summarization’. The LLM would paraphrase all the disparate sentences into a coherent story that is readable. 
RAG along with Prompt Engineering can be used to build powerful knowlege management platforms such as this - https://www.youtube.com/watch?v=lndJ108DlBs

The table belows shows the advantages/disadvantages of both the approaches. For most usecases, a proper utilization of prompt engineering and RAG would suffice. 

Friday, November 03, 2023

Ruminating on Debezium CDC

Debezium is a distributed open source platform for change data capture (CDC). It collects real-time changes to database tables and transmits them to other applications. Debezium is developed on top of Apache Kafka, which provides a dependable and scalable streaming data infrastructure.

Debezium operates by connecting to a database and watching for table updates. When a change is identified, Debezium creates a Kafka event with the change's information. Other applications, such as data pipelines, microservices, and analytics systems, can then ingest these events.



There are several benefits of utilising Debezium CDC, including:

  • Debezium feeds updates to database tables in near real time, allowing other applications to react to changes almost quickly.
  • Debezium is built on Apache Kafka, which provides a dependable and scalable streaming data platform.
  • Debezium can stream updates to a number of databases, including MySQL, PostgreSQL, Oracle, and Cassandra using connectors. 
  • Debezium is simple to install and operate. It has connectors for major databases and may be deployed on a number of platforms, including Kubernetes/Docker.
Use cases for Debezium CDC:
  • Data pipelines and real-time analytics: Debezium can be used to create data pipelines that stream changes from databases to other data systems, such as data warehouses, data lakes, and analytics systems.  For example, you could use Debezium to stream changes from a MySQL database to Apache Spark Streaming. Apache Spark Streaming can then process the events and generate real-time analytics, such as dashboards and reports.

Wednesday, October 11, 2023

Mock data and APIs

Mocking APIs and synthetic mock data generation are invaluable techniques to speed up development. We recently used the Mockaroo platform and found it quite handy to generate dummy data and mock APIs. 

https://www.mockaroo.com/

IBM has also kindly released ~25M records of synthetic financial transacation data that can be used during application development or ML training.

https://github.com/IBM/TabFormer

Other examples of mock data generation tools are:

Leveraging Graph Databases for Fraud Detection

 There are many techniques for building Fraud Detection systems. It can be:

  • Rule Based (tribal knowledge codified)
  • Machine Learning (detect anomalies, patterns, etc.)
There is a third technique using Graph Databases such as Neo4J, TigerGraph or Amazon Neptune
A graph network can assist identify hidden aspects of transactions that would otherwise be missed just by looking at data in a relational table.

Lets consider the example of indenfying fraud in a simple financial transaction. Every financial transaction has thousands of attributes associated with is - e.g. amount, IP address, browser, OS, cookie data, bank, geo-location, card details, recepient,etc.
Using a graph database, we can build a graph network where each transaction is a node and the line connections (aka edges) represent the attributes of the transaction. The following article gives a good primer on how this kind of network would look like - https://towardsdatascience.com/fraud-through-the-eyes-of-a-machine-1dd994405e6e

Once the graph is created, there are many techniques that can be used to detect patterns and relationships between the different attributes. 
  • Link Analysis: This approach is used to detect unusual links between network items. In a financial network, for example, you may check for linkages between accounts engaged in fraudulent activities.
  • Anomaly detection: This approach is used to identify entities or transactions that differ from usual behaviour in a network. In a credit card network, for example, you may watch for transactions performed from strange areas or for abnormally big sums.
  • Cluster Analysis:  This technique is used to identify groups of entities in a network that are closely connected to each other. Clustering may also be used to surface commercial ties or social circles in a transaction banking graph.
Thus, by employing graph analytics, we may detect clusters and links in their data, revealing previously unknown possible fraud connections. More information on such techniques can be found on this blog: https://www.cylynx.io/blog/network-analytics-for-fraud-detection-in-banking-and-finance/

Because of their capacity to track complicated chains of transactions, graph databases are particularly useful in financial crime use cases and fraud detection graph analysis. Traditional RDBMS struggle with these sequence of connections because multiple recursive inner joins are necessary to accomplish this sort of traversal query in SQL, which is very challenging. 

A few articles that give good illustrations on this topic:

Friday, October 06, 2023

Defensive measures for LLM prompts

To prevent abusive prompts and prompt hacking, we need to leverage certain techniques such as Filtering, Post-Prompting, random enclosures, content moderation, etc.

A good explanation of these techniques is given here -- https://learnprompting.org/docs/category/-defensive-measures

Sunday, September 10, 2023

Ruminating on Clickjacking

Clickjacking is a sort of cyberattack in which people are tricked into clicking on something they did not plan to click on. This can be accomplished by superimposing a malicious frame on top of a legal website or injecting a malicious link within an apparently innocent piece of content.

When a user clicks on what appears to be a legitimate website or link, they are in fact clicking on a malicious frame or link. This can then redirect users to a bogus website or run malicious programmes on their PC.

Clickjacking attacks are sometimes difficult to detect because they frequently depend on social engineering tactics to deceive users. For example, the attacker may develop a phoney website that appears to be the actual one, or they could give the victim a link that appears to be from a valid source.

To protect yourself against clickjacking, make use of a pop-up blocker (default in Chrome and many modern browsers).  Any website that asks you to enable Flash or JavaScript should be avoided. Hover your cursor over a link before clicking on it if you are unsure whether it is authentic. If the URL of the link changes, it is most likely malicious.

If you are a developer, please check out the following links to what can be done in your code to reduce the risk of clickjacking. 

https://cheatsheetseries.owasp.org/cheatsheets/Clickjacking_Defense_Cheat_Sheet.html

Tuesday, August 15, 2023

Ruminating on Shadow Testing or Shadow Mirroring

Shadow testing is a software testing technique that involves sending production traffic to a duplicate or shadow environment. This allows testers to compare the behavior of the new feature in the shadow environment to the behavior of the old feature in the production environment. This can help to identify any potential problems with the new feature before it is released to all users.

The following diagram from the Microsoft GitHub site illustrates this concept.


The following blogs/articles explain this concept in good detail:

Monday, July 31, 2023

Ruminating on Differential Privacy

Differential privacy (DP) is a mathematical paradigm for protecting individuals' privacy in datasets. By allowing data to be analysed without disclosing sensitive information about any individual in the dataset, it can protects the privacy of individuals. Thus, it is a method of protecting the privacy of people in a dataset while maintaining the dataset's overall usefulness.

To protect privacy, the most easy option is anonymization, which removes identifying information. A person's name, for example, may be erased from a medical record. Unfortunately, anonymization is rarely enough to provide privacy because the remaining information might be uniquely identifiable. For example, given a person's gender, postal code, age, ethnicity, and height, it may be able to identify them uniquely even in a massive database.

The concept behind differential privacy is to introduce noise into the data in such a manner that it is hard to verify whether any specific individual's data was included in the dataset. This is accomplished by assigning a random value to each data point, which is chosen in such a manner that it has no effect on the overall statistics of the dataset but makes identifying individual data points more difficult.

The following paper by Apple gives a very good overview of how Apple uses Differential Privacy to gain insight into what many Apple users are doing, while helping to preserve the privacy of individual users - https://www.apple.com/privacy/docs/Differential_Privacy_Overview.pdf

Epsilon (ε) is a parameter in differential privacy that affects the amount of noise introduced to the data. A greater epsilon number adds more noise, which gives more privacy but affects the accuracy of the findings.

Here are some examples of epsilon values that might be used in different applications:

  • Healthcare: Epsilon might be set to a small value, such as 0.01, to ensure that the privacy of patients is protected.
  • Marketing: Epsilon might be set to a larger value, such as 1.0, to allow for more accurate results.
  • Government: Epsilon might be set to a very large value, such as 100.0, to allow for the analysis of large datasets without compromising the privacy of individuals.
Thus, the epsilon value chosen represents a trade-off between privacy and accuracy. The lower the epsilon number, the more private the data will be, but the findings will be less accurate. The greater the epsilon number, the more accurate the findings will be, but the data will be less private.
A deep dive into these techniques is illustrated in this paper - https://www.cis.upenn.edu/~aaroth/Papers/privacybook.pdf

Thursday, July 20, 2023

Ruminating on nip.io and Let's Encrypt

nip.io is a free, open-source service that allows you to use wildcard DNS for any IP address. This implies you may build a hostname that resolves to any IP address, no matter where it is. This may be beneficial for a number of things, including:

  • Testing local machine applications. When constructing a local application, you may utilise nip.io to provide it a hostname that can be accessed from anywhere. This makes it simpler to test and distribute the application with others. This service has been made free by a company called as Powerdns. Examples: 
    • 10.0.0.1.nip.io maps to 10.0.0.1
    • 192-168-1-250.nip.io maps to 192.168.1.250
    • 0a000803.nip.io maps to 10.0.8.3  (hexadecimal format)
  • Many online services expect a hostname and do not accept an IP address. In such cases, you can simple append *.nip.io at the end of the public IP address and get a OOTB domain name :)
  • Creae a SSL certificate using letsencrypt:  If you use the "dash" and "hexadecimal" notation of nip.io, then you can easily create a public SSL certificate using "Let's Encrypt" that would be honoured by all browsers. No need of struggling with self-signed certificates. 
ngrok is another great tool that should be in the arsenal of every developer. 

Monday, July 03, 2023

Ruminating on Observability

It is more critical than ever in today's complex and dispersed IT settings to have a complete grasp of how your systems are performing. This is where the concept of observability comes into play. The capacity to comprehend the condition of a system by gathering and analysing data from various sources is referred to as observability.

Observabilty has three critical pillars: 

  • Distributed Logging (using ELK, Splunk)
  • Metrics (performance instrumentation in code)
  • Tracing (E2E visibility across the tech stack)

Distributed Logging: Logs keep track of events that happen in a system. They may be used to discover problems, performance bottlenecks, and the flow of traffic through a system. In a modern scalable distributed architecture, we need logging frameworks that support collection and ingestion of logs across the complete tech stack. Platforms such as Splunk and ELK (Elastic, Logstash, Kibana) support this and are popular frameworks for distributed logging. 

Metrics (performance instrumentation in code): Metrics are numerical measures of a system's status. They may be used to monitor CPU use, memory consumption, and request latency, among other things. Some of the most popular frameworks for metrics are Micrometer , Prometheus and DropWizard Metrics

Tracing (E2E visibility across the tech stack): Traces are a record of a request's route through a system. They may be utilised to determine the core cause of performance issues and to comprehend how various system components interact with one another. A unique Trace-ID is used to corelate the request across all the components of the tech stack. 

Platforms such as Dynatrace, AppDynamics and DataDog provide comprehensive features to implement all aspects of Observability. 

The three observability pillars operate together to offer a complete picture of a system's behaviour. By collecting and analysing data from all three sources, you can acquire a thorough picture of how your systems operate and discover possible issues before they affect your consumers.

There are a number of benefits to implementing the three pillars of observability. These benefits include:

  • The ability to identify and troubleshoot problems faster
  • The ability to improve performance and reliability
  • The ability to make better decisions about system design and architecture

If you want to increase the observability of your systems, I recommend that you study more about the three pillars of observability and the many techniques to apply them. You can take your IT operations to the next level if you have a thorough grasp of observability.

Saturday, May 13, 2023

Ruminating on Prompt Engineering

There has been a lot of buzz in recent years about the potential of large language models (LLMs) to develop new text forms, translate languages, compose various types of creative material, and answer your queries in an instructive manner. However, one of the drawbacks of LLMs is that they may be quite unexpected. Even little changes to the prompt might provide drastically different outcomes. This is where quick engineering comes into play.

The technique of creating prompts that are clear, explicit, and instructive is known as prompt engineering. You may maximise your chances of receiving the desired outcome from your LLM by properly writing your questions.

Given below are some of the techniques you can use to create better prompts:

  • Be precise and concise: The more detailed your instruction, the more likely your LLM will get the intended result. Instead of asking, "Write me a poem," you may say, "Write me a poem about peace".
  • Use keywords: Keywords are words or phrases related to the intended outcome. If you want your LLM to write a blog article about generative AI, for example, you might add keywords like "prompt engineering," "LLMs," and "generative AI."
  • Provide context: Context is information that assists your LLM in comprehending the intended outcome. If you want your LLM to write a poetry about Spring, for example, you might add context by supplying a list of phrases around Spring.
  • Provide examples: Use examples to demonstrate to your LLM what you are looking for. For example, if you want your LLM to create poetry, you may present samples of poems you appreciate.
Andrew NG has created an online course to learn about prompt engineering here - https://www.deeplearning.ai/short-courses/chatgpt-prompt-engineering-for-developers/

In fact, the rise of LLMs has resulted in new job roles like "Prompt Engineer" as highlighted in the articles below: 

Monday, January 16, 2023

API mock servers from OpenAPI specs

 If you have an OpenAPI specs file (YAML or JSON), then you can quickly create a mock server using one of the following tools. 

A list of all other OpenAPI tools is given here: https://openapi.tools/

Saturday, November 19, 2022

Ruminating on the internals of K8

Today Kubernetes has become the defacto standard to deploy applications. To understand what happens behind the scenes when you fire "kubectl" commands, please have a look at this excellent tutorial series by VMWare - https://kube.academy/courses/the-kubernetes-machine

Some key components of the K8 ecosystem. The control plane consists of the API server, Scheduler, etcd and Controller Manager. 

  • kubectl: This is a command line tool that sends HTTP API requests to the K8 API server. The config parameters in your YAML file are actually converted to JSON and a POST request is made to the K8 control plane (API server).
  • etcd: etcd (pronounced et-see-dee) is an open source, distributed, consistent key-value store for shared configuration, service discovery, and scheduler coordination of distributed systems or machine clusters. Kubernetes stores all of its data in etcd, including configuration data, state, and metadata. Because Kubernetes is a distributed system, it requires a distributed data store such as etcd. etcd allows every node in the Kubernetes cluster to read and write data.
  • Scheduler: The kube-scheduler is the Kubernetes controller responsible for assigning pods to nodes in the cluster. We can give hints in the config for affinity/priority, but it is the Scheduler that decides where to create the pod based on memory/cpu requirements and other config params.
  • Controller Manager: A collection of 30+ different controllers - e.g. deployment controller, namespace controller, etc. A controller is a non-terminating control loop (daemon that runs forever) that regulates the state of the system - i.e. move the "existing state" to the "desired state" - e.g. creating/expanding a replica set for a pod template. 
  • Cloud Controller Manager: A K8 cluster has to run on some public/private cloud and hence has to integrate with the respective cloud APIs - to configure underlying storage/compute/network. The Cloud Controller Manager makes API calls to the Cloud Provider to provision these resources - e.g. configuring persistent storage/volume for your pods.  
  • kubelet: The kubelet is the "node agent" that runs on each node. It registers the node with the apiserver. it provides an interface between the Kubernetes control plane and the container runtime on each node in the cluster.  After a successful registration, the primary role of kubelet is to create pods and listen to the API server for instructions.
  • kube-proxy: The Kubernetes network proxy (aka kube-proxy) is a daemon running on each node. It monitors the changes of service and endpoint in the API server, and configures load balancing for the service through iptables. Kubernetes gives pods their own IP addresses and a single DNS name for a set of Pods, and can load-balance across them
Everything in K8 is configured using manifest files (YAML) and hence as users, we just need to use the kubectl command with the appropriate manifest files. Each YAML file represents a K8 object. A Kubernetes object is a "record of intent"--once you create the object, the Kubernetes system will constantly work to ensure that object exists. By creating an object, you're effectively telling the Kubernetes system what you want your cluster's workload to look like; this is your cluster's desired state - e.g. A "deployment" K8 object (with its YAML) provides declarative updates for Pods and ReplicaSets.

Wednesday, September 07, 2022

Ruminating on Hypothesis testing

The following two articles by Rebecca Bevans are an excellent introduction to the concept of Hypothesis testing and the types of statistical tests available:

Snippet from the article on the process of hypothesis testing:

Step 1: State your null and alternate hypothesis

Step 2: Collect data

Step 3: Perform a statistical test

Step 4: Decide whether to reject or "fail to reject" your null hypothesis

Step 5: Present your findings

Free Stats & Finance courses

The following site has an excellent collection of 20 free courses that I would highly recommend for folks who want to learn the basics of finance and fundamentals of maths/stats in finance. 

https://corporatefinanceinstitute.com/collections/?cost=free&card=117884

I really liked the following courses and helped me consolidate my understanding:

- Stats basics:  https://learn.corporatefinanceinstitute.com/courses/take/statistics-fundamentals

- Accounting basics: https://learn.corporatefinanceinstitute.com/courses/take/learn-accounting-fundamentals-corporate-finance

- How to read financial statements: https://learn.corporatefinanceinstitute.com/courses/take/learn-to-read-financial-statements-free-course/

- Data Science fundamentals - https://learn.corporatefinanceinstitute.com/courses/take/data-science-and-machine-learning/

Continuous, Discreet and Categorical variables

The following websites gives an excellent overview for beginners of the 3 different types of variables that we encounter in feature engineering (or even in basic stats):

https://study.com/academy/lesson/continuous-discrete-variables-definition-examples.html

https://www.scribbr.com/methodology/types-of-variables/

Snippets from the articles:

A discrete variable only allows a particular set of values, and in-between values are not included. If we are counting a number of things, that is a discrete value. A dice roll has a certain number of outcomes, and nothing else (we can roll a 4 or a 5, but not a 4.6). A continuous variable can be any value in a range. Usually, things that we are measuring are continuous variables, because it can be any value. The length of a car ride might be 2 hours, 2.5 hours, 2.555, and so on.

Categorical variables are descriptive and not numerical. So any way to describe something is a categorical variable. Hair color, gum flavor, dog breed, and cloud type are all categorical variables.

There are 2 types of categorical variables: Nominal categorical variables are not ordered. The order doesn't matter. Eye color is nominal, because there is no higher or lower eye color. There isn't a reason one is first or last.

Ordinal categorical variables do have an order. Education level is an ordinal variable, because they can be put in order. Note that there is not some exact difference between the levels of education, just that they can be put in order.

Wednesday, August 31, 2022

Ruminating on TMForum

The TM Forum (TMF) is an organisation of over 850 telecom firms working together to drive digital innovation. They created a standard known as TMF Open APIs, which provides a standard interface for the interchange of various telco data.

TM Forum’s Open APIs are JSON-based and follow the REST paradigm. They also share a common data model for Telecom.

Any CSP (Communications service provider) can accelerate their API journey by leveraging the TMForum API contracts. The link below gives some of the examples of the API standards available: 

https://projects.tmforum.org/wiki/display/API/Open+API+Table

Currently, there are around 60+ APIs defined in the Open API table of TMForum. 

Few examples of the APIs are as follows:

  • Customer Bill Management API: This API allows operations to find and retrieve one or several customer bills (also called invoices) produced for a customer.
  • Customer Management API: Provides a standardized mechanism for customer and customer account management, such as creation, update, retrieval, deletion and notification of events.
  • Digital Identity Management API: Provides the ability to manage a digital identity. This digital identity allows identification of an individual, a resource, or a party Role (a specific role - or set of roles - for a given individual).
  • Account Management API: Provides standardized mechanism for the management of billing and settlement accounts, as well as for financial accounting (account receivable) either in B2B or B2B2C contexts.
  • Geographic Address Management API: Provides a standardized client interface to an Address management system. It allows looking for worldwide addresses
  • Geographic Site Management API: Covers the operations to manage (create, read, delete) sites that can be associated with a customer, account, service delivery or other entities.
  • Payment Management API: The Payments API provides the standardized client interface to Payment Systems for notifying about performed payments or refunds.
  • Payment Method Management API: This API supports the frequently-used payment methods for the customer to choose and pay the usage, including voucher card, coupon, and money transfer.
  • Product Ordering Management API: Provides a standardized mechanism for placing a product order with all the necessary order parameters.
  • Promotion Management API: Used to provide the additional discount, voucher, bonus or gift to the customer who meets the pre-defined criteria.
  • Recommendation Management API: Recommendation API is used to recommend offering quickly based on the history and real-time context of a customer.
  • Resource Function Activation Management API: This API introduces Resource Function which is used to represent a Network Service as well as a Network Function.

The GitHub repository of TMForum is a great place to get acquainted with the APIs - https://github.com/tmforum-apis

Since the TMForum defines the data model in JSON format, any noSQL datastore that stores data as JSON documents becomes an easy option to quickly implement an API strategy. For example, TMF data model of the API can be persisted 1:1 in Mongo database without the need for additional mappings as shown here - https://www.mongodb.com/blog/post/why-telcos-implement-tm-forum-open-apis-mongodb

Monday, August 29, 2022

mAP (mean Average Precision) and IoU (Intersection over Union) for Object Detection

mAP (mean Average Precision) is a common metric used for evaluating the accuracy of object detection models. The mAP computes a score by comparing the ground-truth bounding box to the detected box. The higher the score, the more precise the model's detections.

The following articles give a good overview of the concepts of precision, recall, mAP, etc. 

https://jonathan-hui.medium.com/map-mean-average-precision-for-object-detection-45c121a31173

https://blog.paperspace.com/mean-average-precision/

https://blog.paperspace.com/deep-learning-metrics-precision-recall-accuracy/

https://www.narendranaidu.com/2022/01/confusion-matrix-for-classification.html

Some snippets from the above article:

"When a model has high recall but low precision, then the model classifies most of the positive samples correctly but it has many false positives (i.e. classifies many Negative samples as Positive). When a model has high precision but low recall, then the model is accurate when it classifies a sample as Positive but it may classify only some of the positive sample.

Higher the precision, the more confident the model is when it classifies a sample as Positive. The higher the recall, the more positive samples the model correctly classified as Positive.

As the recall increases, the precision decreases. The reason is that when the number of positive samples increases (high recall), the accuracy of classifying each sample correctly decreases (low precision). This is expected, as the model is more likely to fail when there are many samples.


The precision-recall curve makes it easy to decide the point where both the precision and recall are high. The f1 metric measures the balance between precision and recall. When the value of f1 is high, this means both the precision and recall are high. A lower f1 score means a greater imbalance between precision and recall.

The average precision (AP) is a way to summarize the precision-recall curve into a single value representing the average of all precisions. The AP is the weighted sum of precisions at each threshold where the weight is the increase in recall. 

The IoU is calculated by dividing the area of intersection between the 2 boxes by the area of their union. The higher the IoU, the better the prediction.


The mAP is calculated by finding Average Precision(AP) for each class and then average over a number of classes."

Thursday, August 25, 2022

Ruminating on Bloom's Taxonomy

I was trying to help my kids understand the importance of deeply understanding a concept, instead of just remembering facts. 

I found the knowledge pyramid of Bloom an excellent illustration to help my kids understand how to build skills and knowledge. The following article on Vanderbilt University site is a good read to understand the concepts - https://cft.vanderbilt.edu/guides-sub-pages/blooms-taxonomy/



Thursday, August 11, 2022

Handling Distributed Transactions in a microservices environment

 In a distributed microservices environment, we do not have complex 2-phase commit transaction managers. 

We need a simpler approach and we have design patterns to address this issue. A good article explaining this strategy is available here - https://docs.microsoft.com/en-us/azure/architecture/reference-architectures/saga/saga

The crux of the idea is to issue compensating transactions whenever there is a failure in one step of the flow. A good sample usecase for a banking scenario (with source code) is available here - https://github.com/Azure-Samples/saga-orchestration-serverless

Wednesday, August 10, 2022

API contract first design

 I have always been a great fan of the 'Contract-First' API design paradigm. The Swagger (OpenAPI) toolset makes it very simple to design new API contracts and object schemas. 

There is a Swagger Pet Store demo available here, wherein we can design API contracts using a simple YAML file: https://editor.swagger.io/ 

After the API contract has been designed and reviewed, we can quickly generate stubs and client code in 20 languages using Swagger Codegen

Wednesday, July 20, 2022

Ruminating on Data Lakehouse

 In my previous blog posts, we had discussed about Data Lakes and Snowflake architecture

Since Snowflake combines the abilities of a traditional data warehouse and a Data Lake, they also market themselves as a Data Lakehouse

Another competing opensource alternative that is headed by the company databricks is called Delta Lake. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing on top of existing data lakes.

A good comparison between Snowflake and databricks Delta Lake is available here: https://www.firebolt.io/blog/snowflake-vs-databricks-vs-firebolt

Enterprises who are embarking on their data platform modernization strategy can ask the following questions to arrive at a best fit choice:

  • Does the data platform have separation of compute from storage? This will enable it to scale horizontally as your data volumes and processing needs increase.
  • Does the data platform support cloud native storage? Most of the cloud native storage services from the hyperscalers (e.g. AWS S3, Google Big Query, Azure Data Lake) have been battle tested for scalability. 
  • What are the usecases you want to run on your data platform? - e.g. Reporting/BI, Streaming Data Analytics, Low latency dashboards, etc. 

Saturday, June 11, 2022

ServiceNow Gekkobrain & Celonis for SAP S4 HANA modernization

One of the fundamental challenges in migrating SAP ECC to S/4 HANA is the lack of knowledge on the plethora of customizations done on legacy SAP ECC. 

To address this challenge, there are two leading products in the market - ServiceNow Gekkobrain and Celonis.

Gekkobrain helps teams in getting a deeper understanding of custom code that was developed to extend the functionality of SAP. It analyses the custom code base and identifies relevant HANA issues and also try to fix the issue. 

Gekkobrain flows is a component that uses ML to understand and visualize the business process. It can understand main flows, sub flows and deviations and help in identifying automation opportunities. This capability of Gekkobrain flows is also branded as a Process Mining Tool. 

ServiceNow acquired Gekkobrain to integrate it with the NOW platform and leverage its workflow engine to automate workflows for customized SAP code (during HANA migrations). 

Another popular tool that is used for during SAP modernization is the popular process mining tool called as Celonis. Celonis is used for retrieving, visualizing and analyzing business processes from transactional data stored by the SAP ERP systems.