Python詞幹與詞形化
在自然語言處理領域,我們遇到了兩個或兩個以上單詞具有共同根源的情況。 例如,agreed
, agreeing
和 agreeable
這三個詞具有相同的詞根。 涉及任何這些詞的搜索應該把它們當作是根詞的同一個詞。 因此將所有單詞鏈接到它們的詞根變得非常重要。 NLTK庫有一些方法來完成這個鏈接,並給出顯示根詞的輸出。
以下程序使用Porter Stemming算法進行詞幹分析。
import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
# First Word tokenization
nltk_tokens = nltk.word_tokenize(word_data)
#Next find the roots of the word
for w in nltk_tokens:
print ("Actual: %s Stem: %s" % (w,porter_stemmer.stem(w)))
執行上面示例代碼,得到以下結果 -
Actual: It Stem: It
Actual: originated Stem: origin
Actual: from Stem: from
Actual: the Stem: the
Actual: idea Stem: idea
Actual: that Stem: that
Actual: there Stem: there
Actual: are Stem: are
Actual: readers Stem: reader
Actual: who Stem: who
Actual: prefer Stem: prefer
Actual: learning Stem: learn
Actual: new Stem: new
Actual: skills Stem: skill
Actual: from Stem: from
Actual: the Stem: the
Actual: comforts Stem: comfort
Actual: of Stem: of
Actual: their Stem: their
Actual: drawing Stem: draw
Actual: rooms Stem: room
詞形化是類似的詞幹,但是它爲詞語帶來了上下文。所以它進一步將具有相似含義的詞鏈接到一個詞。 例如,如果一個段落有像汽車,火車和汽車這樣的詞,那麼它將把它們全部連接到汽車。 在下面的程序中,使用WordNet詞法數據庫進行詞式化。
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
word_data = "It originated from the idea that there are readers who prefer learning new skills from the comforts of their drawing rooms"
nltk_tokens = nltk.word_tokenize(word_data)
for w in nltk_tokens:
print ("Actual: %s Lemma: %s" % (w,wordnet_lemmatizer.lemmatize(w)))
當我們執行上面的代碼時,它會產生以下結果。
Actual: It Lemma: It
Actual: originated Lemma: originated
Actual: from Lemma: from
Actual: the Lemma: the
Actual: idea Lemma: idea
Actual: that Lemma: that
Actual: there Lemma: there
Actual: are Lemma: are
Actual: readers Lemma: reader
Actual: who Lemma: who
Actual: prefer Lemma: prefer
Actual: learning Lemma: learning
Actual: new Lemma: new
Actual: skills Lemma: skill
Actual: from Lemma: from
Actual: the Lemma: the
Actual: comforts Lemma: comfort
Actual: of Lemma: of
Actual: their Lemma: their
Actual: drawing Lemma: drawing
Actual: rooms Lemma: room