How to get all the words around a word within a fixed proximity

word proximity analysis
proximity search example
proximity operators
proximity search java
proximity match
proximity search algorithm
proximity search in information retrieval
proximity queries in information retrieval

I have texts of variable size (1k-100k characters). I want to get all the words around a given word within fixed proximity. The given word is obtained from a regex so I have the start and the end of the word.

For example:

PROXIMITY_LENGTH = 10  # the fixed proximity
my_text = 'some random words 1123 word1 word123 a'
start, stop = re.search(r'\b1123\b', my_text).span()

print(f'start = {start}, stop = {stop}')
print(my_text[start - PROXIMITY_LENGTH: start]) 
print(my_text[stop: stop + PROXIMITY_LENGTH])

left_limit = my_text[:start - PROXIMITY_LENGTH].rfind(' ') + 1
right_limit = stop + PROXIMITY_LENGTH + my_text[stop + PROXIMITY_LENGTH:].find(' ') 

print('\n')
print(my_text[left_limit: start]) 
print(my_text[stop: right_limit])

output:

start = 18, stop = 22
dom words 
 word1 wor


random words 
 word1 word123

The issues are at the limit, the fixed proximity can cut the last word(from right/left limit). In the above example, I tried to come with a solution, but my solution fails if I have tabs or newline as delimitator between words, ex:

for my_text = 'some\trandom words 1123 word1 word123 a' with my solution I got on the left side: some random words which is wrong.

Any help is appreciated! Thx!

Instead of looking at characters, I will look for words. In that way, you will say, find my target and add N words before and after it:

PROXIMITY_LENGTH = 2  # the fixed proximity
my_text = 'some random words 1123 word1 word123 a \t1123 this too will work'.split()

found = [x.find('1123') for x in my_text]

k = [' '.join(my_text[index-PROXIMITY_LENGTH:index+PROXIMITY_LENGTH+1]) for index, item in enumerate(found) if item == 0]


print(k)

# ['random words 1123 word1 word123', 'word123 a 1123 this too']

Using regex, we can replace found variable with;

found = []
for x in my_text:
    if re.search(r'\b1123\b',x):
        found.append(0)
    else:
        found.append(-1)

The only think I do is split the string to a list :)

[PDF] Proximity Operators, Proximity is the search technique used to find two words next to, near, or within a specified distance of each Using such search operators may result in more have proximity functions. Please check the “Help” link available on all of the databases for document, but if it is, it cannot be within 'n' words of the first word. If rng.Words.count < numwords Then rng.Select Selection.MoveEnd wdWord, 1 Exit Sub End If End If .Collapse wdCollapseEnd End With Loop End With . If the second word is found within the specified number of words, the range starting with the first word and ending with the second word will be selected and the execution of the macro will stop.

This can be done by simply expanding your regex pattern to include the desired number of words around the target match:

L = 2 # using a proximity length of just 2 for demo
my_text = 'some random words 1123 word1 word123 a'
print(re.search(r'(\w+\s+){{0,{0}}}\b1123\b(\s+\w+){{0,{0}}}'.format(L), my_text).group())

This outputs:

random words 1123 word1 word123

How Do I Use Proximity Searching to find a Phrase Within 10 words , The proximity symbol in HeinOnline is the tilde symbol, ~ . To do a proximity search, put all key terms that you want in the query within quotation marks, followed by the tilde ~ and the numerical value that represents the proximity you wish to search. However, you cannot move a picture that was inserted while using the desktop version of Word if the picture has text wrapping or a fixed position on the page. If you find that you cannot make changes to a picture, and you have the desktop version of Word, select Open in Word to open your document in Word and make changes to its layout.

If you would like to get the proximity according to signs (distance from the start/stop) and you wish to get to hole word once the proximity distance ended up in the middle of the word.

In this case I would suggest to search the first None character that is not a letter neither a number. Try the following code:

import re
import string

def get_left_limit(left_string, proximity, right_limit=False):
    if proximity >= len(left_string):
        return len(left_string)

    start_diff = 0
    for letter in reversed(list(left_string[:-proximity])):
        if letter not in (string.ascii_letters + string.digits):
            break
        start_diff += 1
    return proximity + start_diff

def get_right_limit(right_string, proximity):
    if proximity >= len(right_string):
        return len(right_string)

    end_diff = 0
    for letter in list(right_string[proximity:]):
        if letter not in (string.ascii_letters + string.digits):
            break
        end_diff += 1
    return proximity + end_diff


PROXIMITY_LENGTH = 10  # the fixed proximity


# example 1
print('Example: 1')
my_text = 'some random words 1123 word1 word123 a'
start, stop = re.search(r'\b1123\b', my_text).span()
print(f'start = {start}, stop = {stop}')
#
left_proximity = get_left_limit(my_text[:start], PROXIMITY_LENGTH)
right_proximity = get_right_limit(my_text[stop:], PROXIMITY_LENGTH)
print(my_text[start - left_proximity:start])
print(my_text[stop:stop + right_proximity])

# example 2
print()
print('Example: 2')
my_text = 'some\trandom words 1123 word1 word123 a'
start, stop = re.search(r'\b1123\b', my_text).span()
print(f'start = {start}, stop = {stop}')
#
left_proximity = get_left_limit(my_text[:start], PROXIMITY_LENGTH)
right_proximity = get_right_limit(my_text[stop:], PROXIMITY_LENGTH)
print(my_text[start - left_proximity:start])
print(my_text[stop:stop + right_proximity])

The above code will result with:

Example: 1
start = 18, stop = 22
random words 
 word1 word123

Example: 2
start = 18, stop = 22
random words 
 word1 word123

[PDF] efficient k-word proximity search, 2.3.1.2 Word Level Inverted Index . 4.2 Algorithm for K-word Ordered Proximity Search . 7.3 Effect of k (number of words in the input search terms) . There is no way to specify a query “get all results in which. (India), (nuclear) and (US) are collections using around 10–20MB of memory beyond the space required for. Word allows you to put a border around most types of items in your document, such as text, pictures, and tables. You can also add a border to either all the pages in your document or certain pages in your document using section breaks.

Extended Boolean Search: Proximity and Weighting, The extended Boolean functionality of proximity and weighting can enable There are 3 main types of proximity searching: fixed proximity, variable with their undocumented AROUND(x) search operator, I have found its reliability In this query, all of the results must have the phrase “engineer at Google” within 3 words of� Synonyms for proximity at Thesaurus.com with free online thesaurus, antonyms, and definitions. Find descriptive alternatives for proximity.

All the answers are really helpful, but I come with a simple approach, to take all the words within proximity except those from the limit, so if the proximity limit will cut a word that word will not be taken into consideration. This approach is more efficient:

text = ' some random\twords 123 123 - 123 some other random words.' 
regex = r'\b\d((\s*|\s*-\s*)\d){8}\b'
PROXIMITY_LENGTH = 10
REGEX_NO_START_END_WORD = r'\W.+\W'

start, end = re.search(regex, text).span()

left_limit = start - PROXIMITY_LENGTH
if left_limit < 0:
    left_limit = 0

right_limit = end + PROXIMITY_LENGTH
if right_limit > len(text):
    right_limit = len(text)

text_within_proximity = text[left_limit: right_limit]
re.search(REGEX_NO_START_END_WORD, text_within_proximity, flags=re.DOTALL).group()

output:

'\twords 123 123 - 123 some '

Who Offers Proximity Search?, A familiar example is to search for the word manage close to the word managed people, vs. profiles that just have both words somewhere in the text. Google, Bing, and Yandex have all implemented proximity search features. Back in 2010 researchers were excited to read the post “AROUND has� Start studying Word Within Words Examples. Learn vocabulary, terms, and more with flashcards, games, and other study tools.

Proximity search (text), user keywords have a good "overall proximity score" in such results. If only two keywords are in the search query, this has proximity, as the maximum number of words Google Search supports AROUND(#). using the asterisk (*) full- word wildcards: in Google this� In close proximity to definition is - close to : near. How to use in close proximity to in a sentence.

Use proximity in a sentence, Example sentences with the word proximity. proximity example sentences. in whose markets Cuba disposes of almost all her crop, have long enabled her to its proximity to Europe and its convenience for vessels passing around the east The proximity of the weighted roller or rollers to the fixed ones depends upon the � Ask them to find as many words as they can within another word. They must use only the letters within the words. The letters can be in any order. They can use as few letters as they need. Snowy. For example: can you find five words within “snowy?” How about ten? snow; no; nosy; won; so; son; sow; now; own; owns; There’s even more, including soy, sown, and on. Mermaid. Can you find ten words within “mermaid?” maid; aid; air; mad; ear; ram; raid

Query Language, Words in the exception list are treated as placeholders in phrase and proximity queries. then double the quotation marks around the word or words you want to surround with quotes. Boolean and proximity operators can create a more precise query. Vector queries return pages that match a list of words and phrases. Synonyms for within reach at Thesaurus.com with free online thesaurus, antonyms, and definitions. Find descriptive alternatives for within reach.

Comments
  • The issue of using words(in the way that you described) is the regex, is not just one word(space and tabs also included), this way I mentioned that I know the start and the end of the word in the text. Thx for the answer! (I upvoted)
  • Thanks! We can make regex work in many cases. Can you share hardest case examples with regex you are after, and I will try solve it.
  • Ok, the hardest will be like: my_text = ' some random words 123 123 - 123 some other random words.' regex = r'\b\d((\s*|\s*-\s*)\d) {9}\b)'
  • regex =r'\b\d((\s*|\s*-\s*)\d){8}\b'
  • Thx you for your answer! You are looking for words instead of looking at characters!(upvote)
  • right_limit it is used in slicing the text, so there is no need to subtract 1