String Similarity Library

Today, I submitted a new open source project onto Google Code. It is a Java port of a string similarity library that I wrote years ago. The API is a service that calculates a distance or similarity score between two strings. A score of 0.0 means that the two strings are absolutely dissimilar, and 1.0 means that absolutely similar (or equal). Anything in between is a metric on how similar each the two strings are.

Here is a simple example, using the Jaro-Winkler algorithm strategy:


SimilarityStrategy strategy = new JaroWinklerStrategy();
String target = "McDonalds";
String source = "MacMahons";
StringSimilarityService service = new StringSimilarityServiceImpl(strategy);
double score = service.score(source, target); // Score is 0.90

The next step for this project will be documentation and figure out where to host binary releases.

About these ads

3 Responses to “String Similarity Library”


  1. 1 julie April 15, 2011 at 6:01 pm

    Hi Ralph,

    I’m Julie, a Talent Agent with Rosetta Marketing, LLC, among other things. I came across your profile on Linkedin and noticed we belong to the same group. I was hoping you might be able to spare a few minutes sometime in the near term to learn more about Rosetta and in particular, our consistent success in creating value for our clients and the concurrent growth in cutting edge, innovative employment opportunities and culture. In case you missed the article, check out this link about our recent move to support the revitalization of downtown and our exciting plans for the future http://www.cleveland.com/business/index.ssf/2010/01/digital_agency_rosetta_plans_t.html.

    In any event, it would be wonderful to make your acquaintance and learn more about your talents and aspirations. I noticed on your LI profile that you were interested in new opportunities.

    Feel free to call, 216-325-6063 or email,julie.coghlan@rosetta.com when you would like to chat.

    Regards,

    Julie

  2. 2 Denise W. Charles January 12, 2013 at 2:32 am

    That’s it. The rest of the code can be found attached to this post (and that includes some initial unit tests. As a side note, I know the use of Java collections are probably pretty confusing. I really do not like the Java collection API, but did not want to add any new dependencies (like Apache Commons Collections or the Google Collections). Conclusion. The Cosine Similarity algorithm is useful for determining if the composition of two strings are similar, and does not take the order of the strings into account. There are better, though more computationally expensive algorithms for calculating a more accurate percentage of similarity. In later posts, I hope to introduce those algorithms and how they work. Thank you for your time. Richard Resources Eclipse Project: StringSimilarity.zip Works Cited Wikipedia contributors. “Cosine similarity.” Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 3 Dec. 2010. Web. 5 Dec. 2010. Wikipedia contributors. “Dot product.” Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 18 Nov. 2010. Web. 5 Dec. 2010. Wikipedia contributors. “Magnitude (mathematics).” Wikipedia, The Free Encyclopedia. Wikipedia, The Free Encyclopedia, 20 Nov. 2010. Web. 5 Dec. 2010.


  1. 1 Andi's Blog- » String comparison in Java Trackback on March 25, 2012 at 10:02 am
Comments are currently closed.




Follow

Get every new post delivered to your Inbox.

%d bloggers like this: