LLMs have brought significant advancements in natural language processing (NLP) and generation. They are trained on a massive corpus of information, but they do not have context about your needs or private information. Thankfully, there is a way to provide LLMs context and information relevant to your task. Using a vector store, LLMs can be granted new information which was not originally a part of their training process. In this article, we will discuss how to create a vector store using Langchain (which is an open source framework to accelerate LLM application development), and ChromaDB (which is an AI-native open-source embedding database)
- Langchain loaders
- Loading different types of documents into Chroma
- Querying Chroma for similar information
- Summary and next steps
Langchain document loaders
Langchain provides abstraction and framework to work with a variety of LLMs. It provides a standard feature set which allows you to swap the LLM models easily without impacting the application code. The ideology is to not focus on the model being used but to be able to create applications around these models as they are switchable.
To load different types of documents, Langchain provides a massive variety of loaders ranging from CSV, HTML, JSON, Excel, to PDFs. In this article, we will talk about 3 of the most commonly used loaders.
- HTML Loader with Beautiful Soup 4 parser
- CSV Loader
- PDF Loader with PyPDF
One of the most common source of data for LLMs or RAGs is web data which is usually structured as HTML files. With this loader we can load the information from HTML blocks as documents. The HTML parser will use Beautiful Soup 4 to scrap the web data and organise it within the tags.
WebBaseLoader(
"https://docs.getdbt.com/reference/resource-configs/bigquery-configs",
default_parser="html.parser"
)
Another common source of data is structured data in the form of CSV files. With this loader, we can configure delimiters, quoting, newline characters, etc.
CSVLoader(
"../dbt_command_flags.csv",
csv_args={
"delimiter": ",",
"quotechar": '"',
"fieldnames": [
"Flag name", "Type", "Default", "Supported in
project?", "Environment variable", "Command line option",
"Supported in Cloud CLI?"
]
}
)
PDFs are also a very common source of information for LLMs and RAGs which will provide more context into the specific use case. PyPDF is a PDF reader that can parse PDF files into text and then its is loaded as Langchain document format.
PyPDFLoader(
"../dbt_adapter_reference.pdf"
)
As in most use cases, data is from a bunch of different sources and formats, and Lanchain provides a way to combine all of these loaders into a single loader to streamline retrieval process. Merged loader has all of the standard loading methods available on most loaders.
MergedDataLoader(loaders=[web_doc, pdf_doc, csv_doc])
Loading documents into ChromaDB
Now that we have a bunch of documents, we can vectorise them and load into a Vector Store of choice. For the sake of simplicity, we will use ChromaDB. ChromaDB allows self host hosting and also the ability to store Vector Store on machine. This allows you to create an end to end application without relying on paid services and models. Stay tuned for more Vector Stores.
To be able to vectorise documents, we need to first Embed the documents and then load them into the store. For this purpose, we will use Ollama Embeddings https://github.com/ollama/ollama (Ollama also provides a layer to serve your own open source LLM models). This process will split the documents, embed them and load into Chroma DB. We could also add a step to allow Chroma to store the database as files on the machine which can be served and reused by applications
embeddings = OllamaEmbeddings()
text_splitter = RecursiveCharacterTextSplitter()
documents = text_splitter.split_documents(doc_collection)
embedding_function = SentenceTransformerEmbeddings(
model_name="all-MiniLM-L6-v2"
)
vector_store = Chroma.from_documents(documents, embedding_function)
Querying Chroma
Once documents are loaded into the Database, we can query it to find out information similar to the search query using similarity_search
. It might seem like a simple operation and hence it brings us to a question, why not use any DB to store the vectors and write a script to fetch L2 distance? As the number of documents grow, the vector DB will become massive resulting in slower query performance and optimising generic DB for this purpose is very complex. And hence, databases like Chroma come into play, they are designed for Vectors and are optimised for this specific purpose. Chroma provides a bunch of methods:
you can:
.add
.get
.update
.upsert
.delete
.peek
.query
runs the similarity search.
We can now query the database to fetch answer to the questions. Remember, this is not a RAG or an LLM which will give you accurate answer but rather finds similar text. Stay tuned to learn how to integrate vector database with LLMs for a task specific model.
In this case we can design 3 seperate queries to verify if the database fetches data from the desired document.
## Question whose answer will be available only on the CSV document
query = "What is the default value for dbt full_refresh?"
docs = vector_store.similarity_search(query)
print(docs[0].page_content)
## Question whose answer will be available only on the Web document
query = "What are merge benhavior in dbt BigQuery?"
docs = vector_store.similarity_search(query)
print(docs[0])
## Question whose answer will be available only on the PDF document
query = "What is adapter_macro?"
docs = vector_store.similarity_search(query)
print(docs[0].page_content)
Summary
In this article, we discussed how to load documents into vector store which can then be used for custom applications that have specific knowledge. This scenario takes only 3 documents and store in an in-memory DB but we can configure it to be able to store a massive amount of task specific information. To learn about specific vector stores like BigQuery, Postgres, Databricks, etc checkout Langchain’s document https://python.langchain.com/v0.1/docs/integrations/vectorstores/. Stay tuned for ways to integrate Vector store in different applications.