LogoMatt の 小天地
2024

7 分鐘快速理解 LLM、NLP 與 RAG:從基礎概念到實作範例

用 7 分鐘快速了解 LLM (大型語言模型)、NLP (自然語言處理) 與 RAG (檢索增強生成),並附上 Python + Ollama + ChromaDB 的 RAG 範例程式,帶你理解如何透過 Embeddings 與向量資料庫優化 AI 回應。

Retrieval Augmented Generation (RAG)

檢索增強生成 (RAG) 是近年來 AI 領域的重要技術之一。
文章難易度:★★☆☆☆

🌟 本文將帶你認識:

  • 什麼是 RAG 及其應用場景
  • LLM (大型語言模型) 的基本運作原理
  • NLP (自然語言處理) 的核心概念
  • Python + Ollama + ChromaDB 的 實作範例

AI x ML x NLP
來源:inwedo - ML,NLP,LLM, and Deep Learning Explained Exploring the Business Potential of AI


❓ What is LLM (大型語言模型)

LLM = 一個文字接龍的生成式模型

  • 每個回應字元的產生,都是透過機率運算決定的。
  • 例如「台灣」的生成過程,模型會先算出「台」的機率最高,再接續「灣」。

範例:LLM 文字生成過程

第一個字生成機率選擇
我: 10%
台: 80%
灣: 15%
...
是: 55%
玉: 45%
山: 25%

👉 LLM 生成:台

第二個字生成機率選擇
我: 10%
台: 25%
灣: 70%
...
是: 30%
玉: 45%
山: 35%

👉 LLM 生成:台灣

LLM 的限制

  • 可能產生「幻覺」(hallucination)。
  • 沒有「認知邊界」,也就是不知道自己不知道什麼。

❓ What is NLP (自然語言處理)

NLP (Natural Language Processing)

  • 透過 ML 訓練模型處理自然語言。
  • NLP 的訓練結果比 ML 整體模型更容易控制。

❓ What is RAG (檢索增強生成)

RAG = LLM 的外掛知識系統

  • 讓 LLM 能夠「讀取外部資料庫」,回答超出原本訓練知識範圍的問題。
  • 訓練成本低,可靈活整合既有知識。

Basic RAG Structure

核心概念

  1. 資料外掛 (Embeddings):將文件轉為向量儲存於資料庫。
  2. 檢索與生成:Query → Embedding → 相似度檢索 → Prompt → 回答。

🌟 RAG Demo (Ollama + ChromaDB)

👉 文件 Embedding 並儲存到 ChromaDB

import ollama
import chromadb

documents = [
  "Llamas are members of the camelid family meaning they're pretty closely related to vicuñas and camels",
  "Llamas were first domesticated and used as pack animals 4,000 to 5,000 years ago in the Peruvian highlands",
  "Llamas can grow as much as 6 feet tall though the average llama between 5 feet 6 inches and 5 feet 9 inches tall",
  "Llamas weigh between 280 and 450 pounds and can carry 25 to 30 percent of their body weight",
  "Llamas are vegetarians and have very efficient digestive systems",
  "Llamas live to be about 20 years old, though some only live for 15 years and others live to be 30 years old",
]

client = chromadb.Client()

# collection exits ? use it || create one
try:
    collection = client.create_collection(name="docs")
except Exception as e:
    if "Collection docs already exists" in str(e):
        collection = client.get_collection(name="docs")
    else:
        raise e

# ID check
existing_docs = collection.get()
existing_ids = set(existing_docs['ids'])

# Document vectorize and store into vector database
for i, d in enumerate(documents):
    if str(i) in existing_ids:
        print(f"ID {i} already exists, skipping.")
        continue

    response = ollama.embeddings(model="mxbai-embed-large", prompt=d)
    embedding = response["embedding"]
    collection.add(
        ids=[str(i)],
        embeddings=[embedding],
        documents=[d]
    )

👉 將 Query Embedding 後,到資料庫找到最相似的資料(data)

Query = "What animals are llamas related to?"

# vectorize and embeddings
response = ollama.embeddings(
  prompt=Query,
  model="mxbai-embed-large"
)
results = collection.query(
  query_embeddings=[response["embedding"]],
  n_results=1
)
data = results['documents'][0][0]

👉 將 Query + data 整合成一個 Prompt 讓 LLM 針對這個 Prompt 產生回應 👉output['response']

ollama.pull(model="llama2")
# response
output = ollama.generate(
  model="llama2",
  prompt=f"Using this data: {data}. Respond to this prompt: {Query}"
)

print(output['response'])

👉 回應範例

Llamas are members of the camelid family, which means they are closely related to other animals such as:

1. Vicuñas: Vicuñas are small, wild relatives of llamas and alpacas. They are native to South America and are known for their soft, woolly coats.
2. Camels: As the name suggests, camels are also members of the camelid family. They are known for their large size, long eyelashes, and ability to survive in hot, dry environments.
3. Alpacas: Alpacas are domesticated animals that are closely related to llamas and vicuñas. They are native to South America and are known for their soft, luxurious fibers.

So, to summarize, llamas are related to vicuñas, camels, and alpacas. These animals share similar physical and behavioral characteristics due to their shared evolutionary history within the camelid family.

延伸閱讀:

關鍵字 : AI, ML, NLP, RAG, Embeddings