ws4j returns infinity for similarity measures that should return 1

I have a very simple code taken from this example, where I am using the Lin, Path and Wu-Palmer similarity measures to compute the similarity between two words. My code is as follows:

import edu.cmu.lti.lexical_db.ILexicalDatabase;
import edu.cmu.lti.lexical_db.NictWordNet;
import edu.cmu.lti.ws4j.RelatednessCalculator;
import edu.cmu.lti.ws4j.impl.Lin;
import edu.cmu.lti.ws4j.impl.Path;
import edu.cmu.lti.ws4j.impl.WuPalmer;

public class Test {
    private static ILexicalDatabase db = new NictWordNet();
    private static RelatednessCalculator lin = new Lin(db);
    private static RelatednessCalculator wup = new WuPalmer(db);
    private static RelatednessCalculator path = new Path(db);

    public static void main(String[] args) {
        String w1 = "walk";
        String w2 = "trot";
        System.out.println(lin.calcRelatednessOfWords(w1, w2));
        System.out.println(wup.calcRelatednessOfWords(w1, w2));
        System.out.println(path.calcRelatednessOfWords(w1, w2));
    }
}

And the scores are as expected EXCEPT when both words are identical. If both words are the same (e.g. w1 = "walk"; w2 = "walk";), the three measures I have should each return 1.0. But instead, they are returning 1.7976931348623157E308.

I have used ws4j before (the same version, in fact), but I have never seen this behavior. Searching online has not yielded any clues. What could possibly be going wrong here?

P.S. The fact that the Lin, Wu-Palmer and Path measures should return 1 can also be verified with the online demo provided by ws4j

I had a similar problem, and here's what's going on here. I hope that other people who run into this problem will find by response helpful.

If you have noticed, the online demo allows you to choose word sense by specifying word in the following format: word#pos_tag#word_sense. For example, a noun gender with the first word sense would be gender#n#1.

Your code snippet uses the first word sense by default. When I calculate WuPalmer similarity between "gender" and "sex", it will return 0.26. If I use online demo, it will return 1.0. But if we use "gender#n#1" and "sex#n#1" the online demo will return 0.26, so there is no discrepancy. The online demo calculates the max of all pos tag / word sense pairs. Here's a corresponding snippet of code that should do the trick:

ILexicalDatabase db = new NictWordNet();
WS4JConfiguration.getInstance().setMFS(true);
RelatednessCalculator rc = new Lin(db);
String word1 = "gender";
String word2 = "sex";
List<POS[]> posPairs = rc.getPOSPairs();
double maxScore = -1D;

for(POS[] posPair: posPairs) {
    List<Concept> synsets1 = (List<Concept>)db.getAllConcepts(word1, posPair[0].toString());
    List<Concept> synsets2 = (List<Concept>)db.getAllConcepts(word2, posPair[1].toString());

    for(Concept synset1: synsets1) {
        for (Concept synset2: synsets2) {
            Relatedness relatedness = rc.calcRelatednessOfSynset(synset1, synset2);
            double score = relatedness.getScore();
            if (score > maxScore) { 
                maxScore = score;
            }
        }
    }
}

if (maxScore == -1D) {
    maxScore = 0.0;
}

System.out.println("sim('" + word1 + "', '" + word2 + "') =  " + maxScore);

Also, this will give you 0.0 similarity on non-stemmed word forms, e.g. 'genders' and 'sex.' You can use a porter stemmer included in ws4j to make sure you stem words beforehand if needed.

Hope this helps!

java, I have a very simple code taken from this example, where I am using the Lin, Path and Wu-Palmer similarity measures to compute the similarity between two  Stack Overflow for Teams is a private, secure spot for you and your coworkers to find and share information. Learn more ws4j gave the score 1.3333, for similarity measures that should return return between 0 and 1

I had raised this issue at the googlecode site for ws4j, and it turns out that indeed it was a bug. The reply I received is as follows:

This looks like it is due to attempting to override a protected static field (this can't be done in Java). The attached patch fixes the issue by moving the definition of min and max the fields to non-static final members in RelatednessCalculator and adding getters. Implementations then provide their min/max values through super constructor calls.

Patch can be applied with patch -p1 < 0001-Cannot-override-static-members-replacing-fields-with.patch

And here is the (now resolved) issue on their site.

Java API for Semantic Similarity / Relatedness between two “WORDS”, ws4j returns infinity for similarity measures that should return 1. I have a very simple code taken from this example, where I am using the Lin, Path and  Math.sqrt() returns infinity? Refresh. December 2018. However, any number 310 chars or over will return infinity If you want to test it out yourself,

Here is why -

In jcn we have...

sim(c1, c2) = 1 / distance(c1, c2)

distance(c1, c2) = ic(c1) + ic(c2) - (2 * ic(lcs(c1, c2)))

where c1, c2 are the two concepts, ic is the information content of the concept. lcs(c1, c2) is the least common subsumer of c1 and c2.

Now, we don't want distance to be 0 (=> similarity will become undefined).

distance can be 0 in 2 cases...

(1) ic(c1) = ic(c2) = ic(lcs(c1, c2)) = 0

ic(lcs(c1, c2)) can be 0 if the lcs turns out to be the root node (information content of the root node is zero). But since c1 and c2 can never be the root node, ic(c1) and ic(c2) would be 0 only if the 2 concepts have a 0 frequency count, in which case, for lack of data, we return a relatedness of 0 (similar to the lin case).

Note that the root node ACTUALLY has an information content of zero. Technically, none of the other concepts can have an information content value of zero. We assign concepts zero values, when in reality their information content is undefined (due to zero frequency counts). To see why look at the formula for information content: ic(c) = -log(freq(c)/freq(ROOT)) {log(0)? log(1)?}

(2) The second case that distance turns out to be zero is when...

ic(c1) + ic(c2) = 2 * ic(lcs(c1, c2))

(which could have a more likely special case ic(c1) = ic(c2) = ic(lcs(c1, c2)) if all three turn out to be the same concept.)

How should one handle this?

Intuitively this is the case of maximum relatedness (zero distance). For jcn this relatedness would be infinity... But we can't return infinity. And simply returning a 0 wouldn't work... since here we have found a pair of concepts with maximum relatedness, and returning a 0 would be like saying that they aren't related at all.

WS4J Demo, However, the JAR for ws4j 1.0.1 at Google Code includes its own information The value of this option must be the name of a # file, or a relative or absolute  Second, some financial assets may also offer capital appreciation as the prices of these securities change themself. For example, stocks provide both these types of returns, that is dividend income and capital appreciation. There are some stocks which don’t pay dividends, ad only provide price appreciation,

1.7976931348623157E308 is the value of Double.MAX_VALUE but the maximum value of some similarity/relatedness algo (Lin, WuPalmer and Path) are between 0 and 1. Then , for identical synset, the maxium value can be returned is 1. Into the version of my repo (https://github.com/DonatoMeoli/WS4J) i fixed this and other bugs.

Now, for two identical words, the values returned are:

HirstStOnge 16.0
LeacockChodorow 1.7976931348623157E308
Lesk    1.7976931348623157E308
WuPalmer    1.0
Resnik  1.7976931348623157E308
JiangConrath    1.7976931348623157E308
Lin 1.0
Path    1.0
Done in 67 msec.

Process finished with exit code 0

cross-correlation/CrossCorrelationProcessor.java at master · Anatolij , [Description] This measure calculates relatedness by considering the depths of the two synsets in [Description] Resnik defined the similarity between two synsets to be the [Parameters] - min score = 0.0 - max score = Infinity - error score = -1.0 JCN(s1, s2) = 1 / jcn_distance where jcn_distance(s1, s2) = IC(s1) + IC(s2)  Calculating Expected Portfolio Returns and Portfolio Variances - Duration: 12:55. FinanceKid 100,620 views. Risk & Return (1 of 7) - Introduction - Duration: 13:36. Pat Obi 59,388 views.

[PDF] HOAX CATEGORIZATION By Brenda Lee Hooi Fern A REPORT , ws4j returns infinity for similarity measures that should return 1 RelatednessCalculator; import edu.cmu.lti.ws4j.impl.Lin; import  So that it is better to use similarity measures (statistical distribution-based), it returns the exact same image and similar images. There are some more other methods; one of the ways is to extract features using feature extraction method and then the metric could be used which yields better results.

code logical error in SML - smlnj, Other similarity measures have been reviewed, and this includes Wu Stackoverflow.com 2013, ws4j returns infinity for similarity measures that should return 1. Note also that if you wish to compute similarity between objects based on 1+ nominal attributes (dichotomous or polytomous), recode each such variable into the set of dummy binary variables. Then the recommended similarity measure to compute will be Dice (which, when computed for 1+ sets of dummy variables, is equivalent to Ochiai and Kulczynski-2).

ws4j returns infinity for similarity measures that should return 1. I have a very simple code taken from this example, where I am using the Lin, Path and  [57, §1. 1], ontology-based semantic similarity measures exclusiv ely based on ‘is-a’ relationships are currently the best and mos t reli- able strategy to estimate the degree of similarity

Comments
  • Hey, Please tell how to apply patch to the current ws4j jar im using.
  • How does this answer my question? I am specifically concerned with the odd behavior in three scores: Wu-Palmer, Path and Lin.
  • Oh sorry, it is the reason why jiang-conrath gives value of 1.7976931348623157E308 for same words. I didn't see the three methods. My mistake!
  • My whole confusion was about those three methods. JCN is supposed to go to infinity as similarity increases, while the three I mentioned have range [0,1].
  • While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - From Review
  • This looks like it is due to attempting to override a protected static field (this can't be done in Java). Into my code, I fixed the issue by moving the definition of min and max the fields to non-static final members in RelatednessCalculator and adding getters. Implementations then provide their min/max values through super constructor calls.