https://www.aicatalyststudio.co.za/

AI Catalyst Studio, 136 Plover Avenue, KGE, Fourways (2026)

13/10/2024

Describe one approach to enable serving multiple LoRA models at once, which can support requests pointing to multiple LoRA models, while maintaining latency comparable to a request pointing to a single model?

1. First combine LoRA weights for every shared base layer, from all LoRA models, into a series of stacked tensors, one per each base layer.
2.When treating a batch request pointing to multiple LoRA models, define a batched routing mask with weights of 1 assigned to the indices of the target LoRA models’ weights, from the stacked LoRA matrices, while nullifying the rest.
3. When deploying LoRA models at scale for online inference, serving LoRA models with non-merged weights can significantly reduce the cost of serving an additional LoRA model allowing a single base model to dynamically pair with multiple delta LoRA weights, allowing for efficient serving of multiple models on the same set of GPUs over a shared endpoint.

02/09/2024

http://aicatalyststudio.co.za/

AI Catalyst Studio: Innovating your business with tailored AI solutions driven by engineering expertise. Start your AI journey today.

18/08/2024

https://blogs.lse.ac.uk/businessreview/2023/09/14/when-new-technologies-invalidate-management-theories/

In management, as in other disciplines, new technologies sometimes force us to reconsider the conclusions of established theoretical models. Frédéric Fréry writes that this was the case for digital…

11/07/2024

The Bayesian Trap
Let’s assume that you have tested positive for a rare disease that affects 0.1% of the population
You have been told by the doctor that the test will correctly identify 99% of people that have the disease and incorrectly identify 1% of people who don’t have the disease.
Normally you would assume that you have 99% chance of actually having the disease because that is the accuracy of test.
But if we apply Bayes theorem:
- The hypothesis that you have the disease P(H) given that you tested positive = P(H|E)
- The prior probability P(H) that the hypothesis is true (how likely you had the disease before being tested positive) is extremely difficult to determine but a good starting point is to use the frequency of the disease in the population (0.1%)
- The probability of the event given the hypothesis is true i.e. the probability that you would test positive if you had the disease = P(H)

The total probability of the event occurring P(E) i.e. the probability that you actually have the disease
= The probability of having the disease P(H) and correctly testing positive P(E|H)
+
The probability of not having the disease P(-H) and incorrectly testing positive P(E|-H)
= [P(E|H) x 0.1%] / [P(H) x P(E|H) P(-H) x P(E|-H)]
= [0.99 x 0.001] / [(0.001 x 0.99) + (0.999 x 0.01)
= 0.09

This means that you really have a 9% chance of actually having the disease if tested positive

Think about that 🤔

17/06/2024

Insurance fraud algorithms

The popular form of machine learning applied to the insurance industry is called deep anomaly detection. Anomaly detection works by analyzing normal, genuine claims made by the customer and forming a model of what a typical claim looks like. This model is then applied to large data sets

Scenario 1: The dataset has a sufficient number of fraud examples.

In this case, classic machine learning or statistics-based techniques are applied to detect fraudulent attacks. This involves training a machine learning model or employing adequate algorithms to estimate transaction legitimacy. We’ll go through the most commonly used algorithms below.

Scenario 2: The dataset has no (or just a very little number of) fraud examples.

In the case that none of any previous information on fraudulent transactions was stored, the learning model is built based on examples of legitimate transactions

Random Forest or random decision forests. This algorithm ensembles decision trees and accurately analyzes missing data, noise, outliers and errors. It is fast on train and score and, as a consequence, has become one of the preferable among fraud detection professionals.

Artificial Neural Networks (ANN). This system simulates the function of the brain to perform tasks by learning from the past, extract rules and predict future activity based on the current situation. It can predict whether the transaction is fraudulent or not by classifying an input into predefined groups.

Support Vector Machines (SVMs). It’s an excellent prediction tool that can resolve a wide range of learning problems, such as handwritten digit recognition, classification of web pages and face detection. This method is capable of detecting fraudulent activity at the time of transaction.

K-Nearest Neighbors (KNN). Also known as the “lazy learning” algorithm due to its simplicity: instead of making calculations once the data is introduced, it just stores it for further classification. The KNN algorithm rests on feature similarity and its proximity. When the nearest neighbor is fraudulent, the transaction is classified as fraudulent and when the nearest neighbor is legal, it is classified as legal.

Logistic Regression is a prediction algorithm borrowed by machine learning from the fields of statistics. It's widely used for credit card fraud detection and credit scoring.

17/06/2024

There is a lot more information contained in the different documents we use on a daily basis beyond just text data.

For example, GPT-4, Gemini, and Claude are multimodal LLMs that can ingest images as well as text. The images are passed through a Vision Transformer, resulting in visual tokens. The visual tokens are then passed through a projection layer that specializes in aligning visual tokens with text tokens. The visual and text tokens are then provided to the LLM, which cannot make the difference between the different data modes.

In the context of RAG, an LLM plays a role at indexing time, where it generates a vector representation of the data to index it in a vector database. It is also used at retrieval time, where it uses the retrieved documents to provide an answer to a user question. A multimodal LLM can generate embedding representations of images and text and answer questions using those same data types. If we want to answer questions that involve information in different data modes, using a multimodal LLM at indexing and retrieval time is the best option.

If you want to build your RAG pipeline using API providers like OpenAI, you can use GPT-4 for question-answering using multimodal prompts. Even if it is available for text generation, it might not be available for embedding generation. What remains is creating embedding for images. This can be achieved by prompting a multimodal LLM to describe in text the images we need to index. We can then index the images using the text descriptions and their vector representations.

The complexity of generating a text description of an image is not the same as answering questions using a large context of different data types. With a small multimodal LLM, we might get satisfactory results in describing images but subpar results in answering questions. For example, it is pretty simple to build an image description pipeline with LlaVA models and Llama.cpp as LLM backbone. Those descriptions can be used for indexing as well as for answering questions that may involve those images. The LLM answering questions would use the text description of images instead of the images themselves. Today, that might be the simplest option for building a multimodal RAG pipeline. It might not be as performant, but the technology is improving very fast!

17/06/2024

Quantizing a model is not enough. Quantizing won't help you much in saving memory to train a model!

In the typical backpropagation algorithm, the model weights and the input tensors are stored in memory in Float16 or BFloat16. During the forward pass, we only need those. During the backward pass, we create the gradients, again in Float16 or BFloat16.

Once we have the gradients, we can update the model parameters. During the optimization steps, all the operations are done in Float32! If we consider the Adam Optimizer, for example, we need to convert the gradient from Float16 to Float32. With the gradients, we compute the momentum and the variance, which need to be stored in memory in Float 32. From the momentum and variance, we can compute the updated weight values in Float32 as well. We then convert back the model weights from Float32 to Float16 for the next backpropagation iteration.

So, in memory, during the optimization step, we need:

- The model parameters in Float16
- The gradients in Float16
- The gradients in Float32
- The momentum in Float32
- The momentum in Float32
- The model parameters in Float32

Because the Float32 takes twice as much memory as the Float16, the optimizer state requires 8X more memory than the model parameters themselves.

Even when we quantize the model parameters, the memory requirements are the same. Let's say we quantize to 4-bits floating numbers. During the forward pass, the input tensors still come in BFloat16 precision, so we need to dequantize the model parameters to perform the different computations. The same problem occurs during the backward pass, and we need to dequantize the model parameters. And the optimizer computations still happen in Float32 precision by converting the dequantized weights and gradients to Float32.

QLoRA is a good solution in that situation because the gradient updates only happen on the LoRA adapters, which minimizes the optimizer memory spike, and the optimizer updates are paged by buffering the optimizer state to the CPU RAM if need be!

16/06/2024

https://blog.aspiresys.com/robotic-process-automation-rpa/ethics-of-artificial-intelligence-can-a-machine-learn-morality/

As UNESCO says, AI must be for the greater interest of the people, not the other way around. From driverless cars to AlphaFold, (a novel…Read more

AI Catalyst Studio

13/10/2024

02/09/2024

18/08/2024

11/07/2024

17/06/2024

17/06/2024

17/06/2024

16/06/2024

Address

Website

Alerts

Shortcuts

Share

Category