I remember when I was an undergrad, I had this problem in splitting long pages of text to Sentences. (This was for my Final Year Project which was an NLP (Natural Language Processing) application). Having not thought that something so trivial functionality was already available, I wrote code from hand to split the text.
But recently while I was browsing the java.text package I found this purely awesome class, java.text.BreakIterator. This goes beyond the primary focus and also considers Locale specific differences in languages in finding these breaks.
This supports identifying 4 types of boundaries
- Line Boundaries:
- Sentence Boundaries :
- Word Boundaries :
- Character Boundaries :
Methods in this class, such as
previous() gives us the feeling as if we are using an Iterator, but all these methods return an int representing the position of those items. There are some neat code samples in the documentation page. Do try it out!