It's all about Data, IoT and AI: 2023

Wednesday, 29 November 2023

How to use LLM with custom data sitting in a private network

In this blog post, I am going to show how we can use LLM using RAG (Retrieval-augmented generation) technique to answer the questions from our documents/files/web pages/data sitting in a private network.

With RAG architecture, your LLM is not required to train on your data but your provided data as additional context to LLM by asking to answer it.

Here is the RAG architecture:

Here is a simple method to interact with the OpenAI LLM model, you need to define the OpenAI key:

Let's start with a simple prompt without context:

As we can see the LLM can’t answer if you ask any specific question.

Now pass some context with a prompt to the LLM and see the response:

We can see LLM answered it correctly.

Let's see how we can introduce some sort of database that stores all your private/intranet data which LLM does not know about or LLM did not use during the training.

To find the right context information for a given question, we will be using Vector DB which stores the data as embedding vectors. Please read more at Word embedding - Wikipedia

Let’s use a simple example to understand how we can find the relevant context. I am using OpenAI embedding here. Assume we have got two different context information and one question as shown below:

The above code snippet shows the sentences are converted to the numeric vectors which is an embedding vector. We need to find which context is similar to the question. We can do this using cosine similarity using the below code:

We can see question and context 1 has the highest value for similarity than to context 2.

I have a simple chatbot using LangChain, Chroma (an open-source vector db) OpenAI embedding and gpt-3.5-turbo LLM to ask any question related to my website (https://www.musofttech.com.au)

Here is the output of a few questions I have asked:

Now I am asking a question where the website does not have information or context and see how the LLM is responding to it

One last thing, you can use LLM to ask about your intranet/private data but we have to pass the context to LLM so make sure you are not passing any information which are not supposed to go beyond your network. The better option is to have your LLM deployed in your private network so it will be safe to use.

Thanks for reading the blog and let me know if you want to achieve a similar thing for you or your customers.

Thursday, 12 October 2023

Getting started with DBT with Databricks

Let's start with the question what is dbt?

“dbt is a transformation workflow that helps you get more work done while producing higher quality results. You can use dbt to modularize and centralize your analytics code, while also providing your data team with guardrails typically found in software engineering workflows.” You can read more at Whatis dbt? | dbt Developer Hub (getdbt.com)

There are couple of options to get started, one is pip with virtual environment but I will use pip inside a docker container.

Let’s get started, first thing first start the docker container with below command:

$ docker run -p 8888:8888 -p 8000:8080 -v D:/projects/rnd2/:/home/jovyan/work jupyter/minimal-notebook

In above command, I am exposing two ports 1) 8888 and 2) 8080. The first port will be used to connect to jupyter notebook/terminal and the other port will be used to expose the dbt documentation.

Access your juypter notebook server with http://localhost:8888/lab/tree/work like shown below:

Click on the terminal and execute the pip statement:

Note: For other target, you need to install other package. For instance dbt-snowflake for Snowflake.

Once installed, execute the below command to create a dbt project

Copy the /home/jovyan/.dbt/profiles.yml to your project folder like below:

Now everything is done, let’s open the Visual Studio code

You can explore the generated codes to understand the structure and codes etc.

I have generated the databricks token and supplied to profiles.yml file. Go to the terminal in your jupyter notbook server and execute dbt commands.

The first command is the dbt debug, which validates the configuration.

To apply your transformation, run dbt run command.

Points to remember:

Everytime you execute dbt run, a transformation is applied. With this concept you integrate this with orchestration tool like Apache Airflow or even with Databrick workflow or even with Docker container which will run your dbt command and destroy itself.

You can go in databricks to check the bronze schema which will have 1 table loaded.

Lets run dbt test

As we know that first model contains 1 null record.

Lets look at documentation part, run dbt docs generate and then serve. By default it exposes the documentation at port 8080 because we are running this doc server inside a container and we already mapped our 8000 port to 8080. So simply access localhost:8000 will take you to dbt generated documentation