8 Vertex Search

Using LangChain retrievers with Vertex Search on Google Cloud.

This notebook offers an example use of retrieving relevant documents for a query.

In this example, we will add course pdfs from Stanfords’s CS224n class, which covers (rather aptly) NLP and LLMs. The dataset is available at gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224.

! pip install --upgrade google-cloud-aiplatform
! pip install google-cloud-discoveryengine
! pip install shapely<2.0.0
! pip install langchain
! pip install typing-inspect==0.8.0 typing_extensions==4.5.0

# Automatically restart kernel after installs so that your environment can access the new packages
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(True)

{'status': 'ok', 'restart': True}

If you’re on Colab, authenticate via the following cell

from google.colab import auth
auth.authenticate_user()

Add your project id and the search engine id. The search engine will have to be set up in the Google Cloud console. Future versions of the SDK should provide this feature.

PROJECT_ID = "<..>"
SEARCH_ENGINE_ID = "<..>"

Optional parameters

max_documents - The maximum number of documents used to provide extractive segments or extractive answers

get_extractive_answers - By default, the retriever is configured to return extractive segments. Set this field to True to return extractive answers

max_extractive_answer_count - The maximum number of extractive answers returned in each search result. At most 5 answers will be returned

max_extractive_segment_count - The maximum number of extractive segments returned in each search result.

filter - Filter the search results based on document metadata in the data store.

query_expansion_condition - The conditions under which query expansion should occur. 0 - Unspecified query expansion condition. In this case, server behavior defaults to disabled. 1 - Disabled query expansion. Only the exact search query is used, even if SearchResponse.total_size is zero. 2 - Automatic query expansion built by the Search API.

from langchain.retrievers import GoogleCloudEnterpriseSearchRetriever

retriever = GoogleCloudEnterpriseSearchRetriever(
    project_id=PROJECT_ID,
    search_engine_id=SEARCH_ENGINE_ID,
    max_documents=3,
)

query = "What are the goals of the course?"

result = retriever.get_relevant_documents(query)
for doc in result:
    print(doc)

page_content='[draft] note 1: introduction and word2vec cs 224n: natural language processing with\ndeep learning 3\n\navoid in this course. Partly, this is historical and methodological;\nthe raw signal processing methods and expertise are generally\ncovered in other courses (224s!) and other research communities,\nthough there has been some convergence of techniques of late.\nIn all aspects of NLP, most existing tools work for precious few (usu\nally one, maybe up to 100) of the world’s roughly 7000 languages,\nand fail disproportionately much on lesser-spoken and/or marginal\nized dialects, accents, and more. Beyond this, recent successes in\nbuilding better systems have far outstripped our ability to charac\nterize and audit these systems. Biases encoded in text, from race to\ngender to religion and more, are reflected and often amplified by\nNLP systems. With these challenges and considerations in mind, but\nwith the desire to do good science and build trustworthy systems\nthat improve peoples’ lives, let’s take a look at a fascinating first\nproblem in NLP.\n\n2 Representing words\n2.1 Signifier and signified\nConsider the sentence\n\nZuko makes the tea for his uncle.\n\nThe word Zuko is a sign, a symbol that represents an entity Zuko in\nsome (real of imagined) world. The word tea is also a symbol that\nrefers to a signified thing—perhaps a specific instance of tea. If one\nwere instead to say Zuko likes to make tea for his uncle, note that the\nsymbol Zuko still refers to Zuko, but now tea refers to a broader\nclass—tea in general, not a specific bit of hot delicious water. Consider\nthe two following sentences:\n\nZuko makes the coffee for his uncle.\nZuko makes the drink for his uncle.\n\nWhich is “more like” the sentence about tea? The drink may be tea\n(or it may be quite different!) and coffee definitely isn’t tea, but is yet\nsimilar, no? And is Zuko similar to uncle because they both describe\npeople? And is the similar to his because they both pick out specific\ninstances of a class?\nWord meaning is endlessly complex, deriving from humans’ goals\nof communicating with each other and achieving goals in the world.\nPeople use continuous media—speech, signing—but produce signs\nin a discrete, symbolic structure—language—to express complex\nmeanings. Expressing and processing the nuance and wildness of\nlanguage—while achieving the strong transfer of information that' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/cs224n_winter2023_lecture1_notes_draft.pdf', 'id': '2f84b4522da1ad7216b708405a2e7fd1'}
page_content='cs224n: natural language processing with deep learning lecture notes: part ii\nword vectors ii: glove, evaluation and training 10\n\n3. Apply spherical k-means to cluster these context representations.\n4. Finally, each word occurrence is re-labeled to its associated cluster\nand is used to train the word representation for that cluster.\nFor a more rigorous treatment on this topic, one should refer to\nthe original paper.\n\n3 Training for Extrinsic Tasks\nWe have so far focused on intrinsic tasks and emphasized their\nimportance in developing a good word embedding technique. Of\ncourse, the end goal of most real-world problems is to use the result\ning word vectors for some other extrinsic task. Here we discuss the\ngeneral approach for handling extrinsic tasks.\n3.1 Problem Formulation\n\nMost NLP extrinsic tasks can be formulated as classification tasks.\nFor instance, given a sentence, we can classify the sentence to have\npositive, negative or neutral sentiment. Similarly, in named-entity\nrecognition (NER), given a context and a central word, we want\nto classify the central word to be one of many classes. For the in\nput, "Jim bought 300 shares of Acme Corp. in 2006", we would\nlike a classified output "[Jim]Person bought 300 shares of [Acme\nCorp.]Organization in [2006]Time."\n\nFigure 5: We can classify word vectors\nusing simple linear decision boundaries\nsuch as the one shown here (2-D word\nvectors) using techniques such as\nlogistic regression and SVMs\n\nFor such problems, we typically begin with a training set of the\nform:\n\n{x (i)\n\n, y\n\n(i) }\n\nN\n1\n\nwhere x\n\n(i) is a d-dimensional word vector generated by some word\nembedding technique and y\n(i) is a C-dimensional one-hot vector\nwhich indicates the labels we wish to eventually predict (sentiments,\nother words, named entities, buy/sell decisions, etc.).\nIn typical machine learning tasks, we usually hold input data and\ntarget labels fixed and train weights using optimization techniques\n(such as gradient descent, L-BFGS, Newton’s method, etc.). In NLP\napplications however, we introduce the idea of retraining the input\nword vectors when we train for extrinsic tasks. Let us discuss when\nand why we should consider doing this.\n\nImplementation Tip: Word vector\nretraining should be considered for\nlarge training datasets. For small\ndatasets, retraining word vectors will\nlikely worsen performance.' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/cs224n-2019-notes02-wordvecs2.pdf', 'id': '1ff1e859cfb0a87e309c43035be20ec9'}
page_content='2. The Benchmark Tasks\nIn this section, we briefly introduce four standard NLP tasks on which we will benchmark our\narchitectures within this paper: Part-Of-Speech tagging (POS), chunking (CHUNK), Named Entity\nRecognition (NER) and Semantic Role Labeling (SRL). For each of them, we consider a standard\nexperimental setup and give an overview of state-of-the-art systems on this setup. The experimental\nsetups are summarized in Table 1, while state-of-the-art systems are reported in Table 2.\n2.1 Part-Of-Speech Tagging\nPOS aims at labeling each word with a unique tag that indicates its syntactic role, for example, plural\nnoun, adverb, . . .\nA standard benchmark setup is described in detail by Toutanova et al. (2003).\nSections 0–18 of Wall Street Journal (WSJ) data are used for training, while sections 19–21 are for\nvalidation and sections 22–24 for testing.\nThe best POS classifiers are based on classifiers trained on windows of text, which are then fed\nto a bidirectional decoding algorithm during inference. Features include preceding and following\n\n2494' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/collobert11a.pdf', 'id': '3612967ed6bf4badc0bf4808a0b5eada'}

retriever = GoogleCloudEnterpriseSearchRetriever(
    project_id=PROJECT_ID,
    search_engine_id=SEARCH_ENGINE_ID,
    max_documents=3,
    max_extractive_answer_count=3,
    get_extractive_answers=True,
)

query = "Does the course cover transformers?"

result = retriever.get_relevant_documents(query)
for doc in result:
    print(doc)

page_content='On faster GPUs, the pretraining can finish in around 30-40 minutes. This assignment is an investigation into Transformer self-attention building blocks, and the effects of pre training. It covers mathematical properties of Transformers and self-attention through written questions.' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/a5.pdf:2', 'id': 'e45a23e879587067446c6f876341de6d'}
page_content='2. Pretrained Transformer models and knowledge access (35 points) You&#39;ll train a Transformer to perform a task that involves accessing knowledge about the world — knowledge which isn&#39;t provided via the task&#39;s training data (at least if you want to generalize outside the training set).' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/a5.pdf:2', 'id': 'e45a23e879587067446c6f876341de6d'}
page_content='CS 224N Assignment 5 Page 2 of 10 1.' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/a5.pdf:2', 'id': 'e45a23e879587067446c6f876341de6d'}
page_content='Image Transformer of parameters and consequently computational performance and can make training such models more challenging.' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/1802.05751.pdf:1', 'id': '0ed0222cb61a3398c75d4df1093e6562'}
page_content='Our best CIFAR-10 model with DMOL has d and feed-forward layer layer dimension of 256 and perform attention in 512 dimensions.' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/1802.05751.pdf:1', 'id': '0ed0222cb61a3398c75d4df1093e6562'}
page_content='Training recurrent neural networks to sequentially predict each pixel of even a small image is computationally very challenging.' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/1802.05751.pdf:1', 'id': '0ed0222cb61a3398c75d4df1093e6562'}
page_content='In Chapter 10 we explored causal (left-to-right) transformers that can serve as the basis for powerful language models—models that can eas ily be applied to autoregressive generation problems such as contextual generation, summarization and machine translation.' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/11.pdf:10', 'id': '26b3bc83d90348c17ebad26a49853226'}
page_content='Every input sentence first has to be tokenized, and then all further processing takes place on subword tokens rather than words.' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/11.pdf:10', 'id': '26b3bc83d90348c17ebad26a49853226'}
page_content='11.3.1 Sequence Classification Sequence classification applications often represent an input sequence with a single consolidated representation.' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/11.pdf:10', 'id': '26b3bc83d90348c17ebad26a49853226'}

query = "What is word2vec?"

result = retriever.get_relevant_documents(query)
for doc in result:
    print(doc)

page_content='However, many of the details of word2vec will hold true in methods that we&#39;ll proceed to further in the course, so we&#39;ll focus our time on that. 3.2 Word2vec model and objective The word2vec model represents each word in a fixed vocabulary as a low-dimensional (much smaller than vocabulary size) vector.' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/cs224n_winter2023_lecture1_notes_draft.pdf:8', 'id': '2f84b4522da1ad7216b708405a2e7fd1'}
page_content='[draft] note 1: introduction and word2vec cs 224n: natural language processing with deep learning 4 language is intended to achieve—makes representing words an endlessly fascinating problem. Let&#39;s move to some methods. 2.2 Independent words, independent vectors What is a word?' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/cs224n_winter2023_lecture1_notes_draft.pdf:8', 'id': '2f84b4522da1ad7216b708405a2e7fd1'}
page_content='The word2vec model is a probabilistic model specified as follows, where uw refers to the row of U corresponding to word w ∈ V (and likewise for V): pU,V(o|c) = exp u ⊤ o vc ∑w∈V exp u⊤ w vc (4) This may be familiar to you as the softmax function, which takes arbitrary scores (here, one for each word in the vocabulary, resulting from dot products) and produces a probability distribution where larger-scored things get higher probability.' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/cs224n_winter2023_lecture1_notes_draft.pdf:8', 'id': '2f84b4522da1ad7216b708405a2e7fd1'}
page_content='Although they learn word embeddings by optimizing over some objective functions using stochastic gradient methods, they have both been shown to be implicitly performing matrix factorizations. Skip-gram Skip-gram Word2Vec maximizes the likelihood of co-occurrence of the center word and context words.' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/NeurIPS-2018-on-the-dimensionality-of-word-embedding-Paper.pdf:2', 'id': '55b0877de28706e9b8ec530df0c60fa3'}
page_content='For three popular embedding algorithms: LSA, skip-gram Word2Vec and GloVe, we find their optimal dimensionalities k ∗ that minimize their respective PIP loss. We define the sub-optimality of a particular dimensionality k as the additional PIP loss com pared with k ∗ : EkET k − EET − Ek∗Ek∗ T − EET .' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/NeurIPS-2018-on-the-dimensionality-of-word-embedding-Paper.pdf:2', 'id': '55b0877de28706e9b8ec530df0c60fa3'}
page_content='It states that two embeddings are essentially identical if one can be obtained from the other by performing a unitary operation, eg, a rotation. A unitary operation on a vector corresponds to multiplying the vector by a unitary matrix, ie v = vU, where U TU = UUT = Id.' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/NeurIPS-2018-on-the-dimensionality-of-word-embedding-Paper.pdf:2', 'id': '55b0877de28706e9b8ec530df0c60fa3'}
page_content='With word2vec, we train the skip-gram (SG† ) and continuous bag-of-words (CBOW† ) models on the 6 billion token corpus (Wikipedia 2014 + Giga word 5) with a vocabulary of the top 400000 most frequent words and a context window size of 10.' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/glove.pdf:10', 'id': '2d92bcf9f62444811e6640154cbe484e'}
page_content='The most important remaining variable to con trol for is training time. For GloVe, the rele vant parameter is the number of training iterations. For word2vec, the obvious choice would be the number of training epochs. Unfortunately, the code is currently designed for only a single epoch:' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/glove.pdf:10', 'id': '2d92bcf9f62444811e6640154cbe484e'}
page_content='We construct a model that utilizes this main ben efit of count data while simultaneously capturing the meaningful linear substructures prevalent in recent log-bilinear prediction-based methods like word2vec.' metadata={'source': 'gs://cloud-samples-data/gen-app-builder/search/stanford-cs-224/glove.pdf:10', 'id': '2d92bcf9f62444811e6640154cbe484e'}