非構造化データ検索AIエージェントツール

プレビュー

この記事では、Mosaic AI Agent Framework を使用して非構造化データを取得するための AI エージェントツールを作成する方法について説明します。非構造化リトリーバーを使用すると、エージェントはベクトル検索インデックスを使用して、ドキュメントコーパスなどの非構造化データソースをクエリできます。

エージェントツールの詳細については、AI エージェントツールを参照してください。

AI Bridgeを使用したローカルでのVector Searchレトリーバーの開発

Databricks Vector Searchレトリーバーツールの開発を開始する最も簡単な方法は、ローカルで開発することです。databricks-langchain や databricks-openai などの Databricks AI Bridge パッケージを使用して、クエリパラメーターを使用してエージェントとエクスペリメントにレトリーバー機能をすばやく追加します。このアプローチにより、初期開発中の迅速なイテレーションが可能になります。

ローカルツールの準備ができたら、エージェントコードの一部として直接プロダクション化したり、見つけやすさとガバナンスが向上する Unity Catalog 関数に移行したりできますが、一定の制限があります。 Unity Catalog 関数を用いたVector Searchリトリーバーを参照してください。

注：

Databricksの外部でホストされている外部ベクトルインデックスを使用するには、Databricksの外部でホストされているベクトルインデックスを使用したVector Searchレトリーバーを参照してください。

次のコードは、レトリーバーツールのプロトタイプを作成し、それをローカルの LLM にバインドして、エージェントとチャットしてツール呼び出しの動作をテストできるようにします。

Databricks AI Bridge を含む最新バージョンの databricks-langchain をインストールします。

%pip install --upgrade databricks-langchain

次の例では、Databricks 製品ドキュメントからコンテンツをフェッチする架空のベクトル検索インデックスをクエリします。

明確で説明的な tool_descriptionを提供します。エージェント LLM は、 tool_description を使用してツールを理解し、ツールを呼び出すタイミングを決定します。

from databricks_langchain import VectorSearchRetrieverTool, ChatDatabricks

# Initialize the retriever tool.
vs_tool = VectorSearchRetrieverTool(
  index_name="catalog.schema.my_databricks_docs_index",
  tool_name="databricks_docs_retriever",
  tool_description="Retrieves information about Databricks products from official Databricks documentation."
)

# Run a query against the vector search index locally for testing
vs_tool.invoke("Databricks Agent Framework?")

# Bind the retriever tool to your Langchain LLM of choice
llm = ChatDatabricks(endpoint="databricks-meta-llama-3-1-70b-instruct")
llm_with_tools = llm.bind_tools([vs_tool])

# Chat with your LLM to test the tool calling functionality
llm_with_tools.invoke("Based on the Databricks documentation, what is Databricks Agent Framework?")

注：

VectorSearchRetrieverToolを初期化する場合、自己管理型のエンべディングを持つ Delta Sync インデックスと Direct Vector Access Index には、text_column 引数と embedding 引数が必要です。エンベディングを提供するためのオプションを参照してください。

詳細については、次の API ドキュメントを参照してください。 VectorSearchRetrieverTool

from databricks_langchain import VectorSearchRetrieverTool
from databricks_langchain import DatabricksEmbeddings

embedding_model = DatabricksEmbeddings(
    endpoint="databricks-bge-large-en",
)

vs_tool = VectorSearchRetrieverTool(
  index_name="catalog.schema.index_name", # Index name in the format 'catalog.schema.index'
  num_results=5, # Max number of documents to return
  columns=["primary_key", "text_column"], # List of columns to include in the search
  filters={"text_column LIKE": "Databricks"}, # Filters to apply to the query
  query_type="ANN", # Query type ("ANN" or "HYBRID").
  tool_name="name of the tool", # Used by the LLM to understand the purpose of the tool
  tool_description="Purpose of the tool", # Used by the LLM to understand the purpose of the tool
  text_column="text_column", # Specify text column for embeddings. Required for direct-access index or delta-sync index with self-managed embeddings.
  embedding=embedding_model # The embedding model. Required for direct-access index or delta-sync index with self-managed embeddings.
)

次のコードは、ベクトル検索レトリーバーツールのプロトタイプを作成し、OpenAI の GPT モデルと統合します。

ツールの OpenAI の推奨事項の詳細については、 OpenAI 関数呼び出しのドキュメントを参照してください。

Databricks AI Bridge を含む最新バージョンの databricks-openai をインストールします。

%pip install --upgrade databricks-openai

次の例では、Databricks 製品ドキュメントからコンテンツをフェッチする架空のベクトル検索インデックスをクエリします。

明確で説明的な tool_descriptionを提供します。エージェント LLM は、 tool_description を使用してツールを理解し、ツールを呼び出すタイミングを決定します。

from databricks_openai import VectorSearchRetrieverTool
from openai import OpenAI
import json

# Initialize OpenAI client
client = OpenAI(api_key=<your_API_key>)

# Initialize the retriever tool
dbvs_tool = VectorSearchRetrieverTool(
  index_name="catalog.schema.my_databricks_docs_index",
  tool_name="databricks_docs_retriever",
  tool_description="Retrieves information about Databricks products from official Databricks documentation"
)

messages = [
  {"role": "system", "content": "You are a helpful assistant."},
  {
    "role": "user",
    "content": "Using the Databricks documentation, answer what is Spark?"
  }
]
first_response = client.chat.completions.create(
  model="gpt-4o",
  messages=messages,
  tools=[dbvs_tool.tool]
)

# Execute function code and parse the model's response and handle function calls.
tool_call = first_response.choices[0].message.tool_calls[0]
args = json.loads(tool_call.function.arguments)
result = dbvs_tool.execute(query=args["query"])  # For self-managed embeddings, optionally pass in openai_client=client

# Supply model with results – so it can incorporate them into its final response.
messages.append(first_response.choices[0].message)
messages.append({
  "role": "tool",
  "tool_call_id": tool_call.id,
  "content": json.dumps(result)
})
second_response = client.chat.completions.create(
  model="gpt-4o",
  messages=messages,
  tools=[dbvs_tool.tool]
)

注：

VectorSearchRetrieverToolを初期化する場合、自己管理型のエンべディングを持つ Delta Sync インデックスと Direct Vector Access Index には、text_column 引数と embedding 引数が必要です。エンベディングを提供するためのオプションを参照してください。

詳細については、次の API ドキュメントを参照してください。 VectorSearchRetrieverTool

from databricks_openai import VectorSearchRetrieverTool

vs_tool = VectorSearchRetrieverTool(
    index_name="catalog.schema.index_name", # Index name in the format 'catalog.schema.index'
    num_results=5, # Max number of documents to return
    columns=["primary_key", "text_column"], # List of columns to include in the search
    filters={"text_column LIKE": "Databricks"}, # Filters to apply to the query
    query_type="ANN", # Query type ("ANN" or "HYBRID").
    tool_name="name of the tool", # Used by the LLM to understand the purpose of the tool
    tool_description="Purpose of the tool", # Used by the LLM to understand the purpose of the tool
    text_column="text_column", # Specify text column for embeddings. Required for direct-access index or delta-sync index with self-managed embeddings.
    embedding_model_name="databricks-bge-large-en" # The embedding model. Required for direct-access index or delta-sync index with self-managed embeddings.
)

Unity Catalog関数を用いたベクトル検索レトリーバーツール

次の例では、Unity Catalog 関数を使用してレトリーバーツールを作成し、 Mosaic AI Vector Search インデックスからデータをクエリします。

Unity Catalog 関数databricks_docs_vector_searchは、Databricksドキュメントを含む架空のベクトル検索インデックスをクエリします。この関数は、Databricks SQL 関数 vector_search() をラップし、その出力を MLflow レトリーバースキーマに揃えます。 page_content エイリアスと metadata エイリアスを使用します。

注：

MLflow レトリーバースキーマに準拠するには、最上位の出力キーとしてではなく、SQL マップ関数を使用して、追加のメタデータ列を metadata 列に追加する必要があります。

ノートブックまたは SQL エディターで次のコードを実行して、関数を作成します。

CREATE OR REPLACE FUNCTION main.default.databricks_docs_vector_search (
  -- The agent uses this comment to determine how to generate the query string parameter.
  query STRING
  COMMENT 'The query string for searching Databricks documentation.'
) RETURNS TABLE
-- The agent uses this comment to determine when to call this tool. It describes the types of documents and information contained within the index.
COMMENT 'Executes a search on Databricks documentation to retrieve text documents most relevant to the input query.' RETURN
SELECT
  chunked_text as page_content,
  map('doc_uri', url, 'chunk_id', chunk_id) as metadata
FROM
  vector_search(
    -- Specify your Vector Search index name here
    index => 'catalog.schema.databricks_docs_index',
    query => query,
    num_results => 5
  )

このレトリーバーツールを AI エージェントで使用するには、 UCFunctionToolkitでラップします。これにより、MLflow による自動トレースが可能になります。

MLflow トレースは、ジェネレーション AI アプリケーションの詳細な実行情報をキャプチャします。各ステップの入力、出力、メタデータをログに記録し、問題のデバッグとパフォーマンスの分析を支援します。

UCFunctionToolkitを使用すると、レトリーバーは、その出力が MLflow レトリーバースキーマに準拠している場合、MLflow ログに RETRIEVER スパン型を自動的に生成します。MLflow トレーススキーマを参照してください。

UCFunctionToolkitの詳細については Unity Catalog ドキュメントを参照してください。

from unitycatalog.ai.langchain.toolkit import UCFunctionToolkit

toolkit = UCFunctionToolkit(
    function_names=[
        "main.default.databricks_docs_vector_search"
    ]
)
tools = toolkit.tools

このレトリーバーツールには、次の注意事項があります。

SQL クライアントでは、返される行数またはバイト数の制限がある場合があります。データの切り捨てを防ぐには、UDF によって返される列値を切り捨てる必要があります。たとえば、 substring(chunked_text, 0, 8192) を使用して、大きなコンテンツ列のサイズを縮小し、実行中の行の切り捨てを回避できます。
このツールは vector_search() 関数のラッパーであるため、 vector_search() 関数と同じ制限が適用されます。制限事項を参照してください。

Databricksの外部でホストされているベクトルインデックスを使用したVector Searchレトリーバー

ベクトルインデックスが Databricksの外部でホストされている場合は、 Unity Catalog の接続を作成して外部サービスに接続し、接続エージェントコードを使用できます。詳細については、AI エージェントツールを外部サービスに接続するを参照してください。

次の例では、Databricks の外部でホストされている PyFunc フレーバーエージェントのベクトルインデックスを呼び出すベクトル検索レトリーバーを作成します。

外部サービス (この場合は Azure) への Unity Catalog 接続を作成します。

CREATE CONNECTION ${connection_name}
TYPE HTTP
OPTIONS (
  host 'https://example.search.windows.net',
  base_path '/',
  bearer_token secret ('<secret-scope>','<secret-key>')
);

エージェントコードで取得ツールを定義するには、作成した Unity Catalog Connection を使用します。この例では、MLflow デコレーターを使用してエージェントトレースを有効にします。

注：

MLflow レトリーバーコンポーネントに準拠するために、レトリーバーコンポーネント関数は Document 型を返し、Document クラスの metadata フィールドを使用して、返されたドキュメントに like doc_uri や similarity_score.

import mlflow
import json

from mlflow.entities import Document
from typing import List, Dict, Any
from dataclasses import asdict

class VectorSearchRetriever:
  """
  Class using Databricks Vector Search to retrieve relevant documents.
  """

  def __init__(self):
    self.azure_search_index = "hotels_vector_index"

  @mlflow.trace(span_type="RETRIEVER", name="vector_search")
  def __call__(self, query_vector: List[Any], score_threshold=None) -> List[Document]:
    """
    Performs vector search to retrieve relevant chunks.
    Args:
      query: Search query.
      score_threshold: Score threshold to use for the query.

    Returns:
      List of retrieved Documents.
    """
    from databricks.sdk import WorkspaceClient
    from databricks.sdk.service.serving import ExternalFunctionRequestHttpMethod

    json = {
      "count": true,
      "select": "HotelId, HotelName, Description, Category",
      "vectorQueries": [
        {
          "vector": query_vector,
          "k": 7,
          "fields": "DescriptionVector",
          "kind": "vector",
          "exhaustive": true,
        }
      ],
    }

    response = (
      WorkspaceClient()
      .serving_endpoints.http_request(
        conn=connection_name,
        method=ExternalFunctionRequestHttpMethod.POST,
        path=f"indexes/{self.azure_search_index}/docs/search?api-version=2023-07-01-Preview",
        json=json,
      )
      .text
    )

    documents = self.convert_vector_search_to_documents(response, score_threshold)
    return [asdict(doc) for doc in documents]

  @mlflow.trace(span_type="PARSER")
  def convert_vector_search_to_documents(
    self, vs_results, score_threshold
  ) -> List[Document]:
    docs = []

    for item in vs_results.get("value", []):
      score = item.get("@search.score", 0)

      if score >= score_threshold:
        metadata = {
          "score": score,
          "HotelName": item.get("HotelName"),
          "Category": item.get("Category"),
        }

        doc = Document(
          page_content=item.get("Description", ""),
          metadata=metadata,
          id=item.get("HotelId"),
        )
        docs.append(doc)

    return docs

レトリーバーを実行するには、次の Python コードを実行します。必要に応じて、結果をフィルタリングするために、要求にベクトル検索フィルターを含めることができます。
```
retriever = VectorSearchRetriever()
query = [0.01944167, 0.0040178085 . . .  TRIMMED FOR BREVITY 010858015, -0.017496133]
results = retriever(query, score_threshold=0.1)
```

レトリーバーのスキーマ設定

レトリーバーまたは span_type="RETRIEVER" から返されたトレースが MLflow の標準レトリーバースキーマに準拠していない場合は、返されたスキーマを MLflow の予期されるフィールドに手動でマップする必要があります。これにより、MLflow はレトリーバーを適切にトレースし、ダウンストリームアプリケーションでトレースを正しくレンダリングできます。

レトリーバースキーマを手動で設定するには、 mlflow.models.set_retriever_schema エージェントを定義するとき。 set_retriever_schema を使用して、返されたテーブルの列名を MLflow の想定フィールド (primary_key、text_column、doc_uriなど) にマップします。

# Define the retriever's schema by providing your column names
mlflow.models.set_retriever_schema(
    name="vector_search",
    primary_key="chunk_id",
    text_column="text_column",
    doc_uri="doc_uri"
    # other_columns=["column1", "column2"],
)

また、レトリーバーのスキーマで追加の列を指定するには、 other_columns フィールドに列名のリストを指定します。

複数のレトリーバーがある場合は、各レトリーバースキーマに一意の名前を使用して、複数のスキーマを定義できます。

エージェント作成時に設定されたレトリーバースキーマは、レビューアプリや評価セットなどのダウンストリームアプリケーションやワークフローに影響します。具体的には、 doc_uri 列は、レトリーバーによって返されるドキュメントのプライマリ識別子として機能します。

レビューアプリにはdoc_uriが表示され、レビュー担当者が回答を評価し、ドキュメントの出所を追跡するのに役立ちます。アプリのUIを確認するを参照してください。
評価セット では、 doc_uri を使用して、レトリーバーの結果を事前定義された評価データセットと比較し、レトリーバーの再現率と精度を判断します。評価セットを参照してください。

レトリーバーをトレースする

MLflow トレースは、エージェントの実行に関する詳細な情報をキャプチャすることで、可観測性を追加します。これにより、リクエストの各中間ステップに関連付けられた入力、出力、メタデータを記録する方法が提供され、バグや予期しない動作のソースをすばやく特定できます。

この例では、 @mlflow.trace デコレータを使用して、レトリーバとパーサのトレースを作成します。トレース方法を設定するためのその他のオプションについては、エージェントの MLflow トレースを参照してください。

デコレータは、関数が呼び出されたときに開始し、関数が戻ったときに終了するスパンを作成します。 MLflow は、関数の入力と出力、および発生した例外を自動的に記録します。

注：

LangChain、LlamaIndex、OpenAI ライブラリのユーザーは、デコレータでトレースを手動で定義する代わりに、MLflow の自動ロギングを使用できます。自動ログを使用してエージェントにトレースを追加するを参照してください。

...
@mlflow.trace(span_type="RETRIEVER", name="vector_search")
def __call__(self, query: str) -> List[Document]:
  ...

Agent Evaluation や AI Playground などのダウンストリームアプリケーションでレトリーバートレースが正しくレンダリングされるようにするには、デコレータが次の要件を満たしていることを確認してください。

span_type="RETRIEVER"を使用して、関数がオブジェクトを返すList[Document]を確認します。レトリーバーのスパンを参照してください。
トレースを正しく構成するには、トレース名とretriever_schema名が一致している必要があります。

次のステップ

Unity Catalog 関数エージェントツールを作成したら、そのツールを AI エージェントに追加します。エージェントへの Unity Catalog ツールの追加を参照してください。