it-swarm.com.de

Englischsprachige Kontraktionen in Python erweitern

Die englische Sprache hat ein paar Wehen . Zum Beispiel:

you've -> you have
he's -> he is

Diese können manchmal Kopfschmerzen verursachen, wenn Sie eine Verarbeitung in natürlicher Sprache durchführen. Gibt es eine Python-Bibliothek, die diese Kontraktionen erweitern kann?

26
Maarten

Ich habe diese Wikipedia-Erweiterungsseite zu einem Python-Wörterbuch gemacht (siehe unten).

Beachten Sie, dass Sie bei der Abfrage des Wörterbuchs unbedingt doppelte Anführungszeichen verwenden möchten:

enter image description here

Ich habe auch mehrere Optionen wie auf der Wikipedia-Seite gelassen. Fühlen Sie sich frei, es zu ändern, wie Sie möchten. Beachten Sie, dass die Disambiguierung der richtigen Erweiterung ein schwieriges Problem sein könnte!

contractions = { 
"ain't": "am not / are not / is not / has not / have not",
"aren't": "are not / am not",
"can't": "cannot",
"can't've": "cannot have",
"'cause": "because",
"could've": "could have",
"couldn't": "could not",
"couldn't've": "could not have",
"didn't": "did not",
"doesn't": "does not",
"don't": "do not",
"hadn't": "had not",
"hadn't've": "had not have",
"hasn't": "has not",
"haven't": "have not",
"he'd": "he had / he would",
"he'd've": "he would have",
"he'll": "he shall / he will",
"he'll've": "he shall have / he will have",
"he's": "he has / he is",
"how'd": "how did",
"how'd'y": "how do you",
"how'll": "how will",
"how's": "how has / how is / how does",
"I'd": "I had / I would",
"I'd've": "I would have",
"I'll": "I shall / I will",
"I'll've": "I shall have / I will have",
"I'm": "I am",
"I've": "I have",
"isn't": "is not",
"it'd": "it had / it would",
"it'd've": "it would have",
"it'll": "it shall / it will",
"it'll've": "it shall have / it will have",
"it's": "it has / it is",
"let's": "let us",
"ma'am": "madam",
"mayn't": "may not",
"might've": "might have",
"mightn't": "might not",
"mightn't've": "might not have",
"must've": "must have",
"mustn't": "must not",
"mustn't've": "must not have",
"needn't": "need not",
"needn't've": "need not have",
"o'clock": "of the clock",
"oughtn't": "ought not",
"oughtn't've": "ought not have",
"shan't": "shall not",
"sha'n't": "shall not",
"shan't've": "shall not have",
"she'd": "she had / she would",
"she'd've": "she would have",
"she'll": "she shall / she will",
"she'll've": "she shall have / she will have",
"she's": "she has / she is",
"should've": "should have",
"shouldn't": "should not",
"shouldn't've": "should not have",
"so've": "so have",
"so's": "so as / so is",
"that'd": "that would / that had",
"that'd've": "that would have",
"that's": "that has / that is",
"there'd": "there had / there would",
"there'd've": "there would have",
"there's": "there has / there is",
"they'd": "they had / they would",
"they'd've": "they would have",
"they'll": "they shall / they will",
"they'll've": "they shall have / they will have",
"they're": "they are",
"they've": "they have",
"to've": "to have",
"wasn't": "was not",
"we'd": "we had / we would",
"we'd've": "we would have",
"we'll": "we will",
"we'll've": "we will have",
"we're": "we are",
"we've": "we have",
"weren't": "were not",
"what'll": "what shall / what will",
"what'll've": "what shall have / what will have",
"what're": "what are",
"what's": "what has / what is",
"what've": "what have",
"when's": "when has / when is",
"when've": "when have",
"where'd": "where did",
"where's": "where has / where is",
"where've": "where have",
"who'll": "who shall / who will",
"who'll've": "who shall have / who will have",
"who's": "who has / who is",
"who've": "who have",
"why's": "why has / why is",
"why've": "why have",
"will've": "will have",
"won't": "will not",
"won't've": "will not have",
"would've": "would have",
"wouldn't": "would not",
"wouldn't've": "would not have",
"y'all": "you all",
"y'all'd": "you all would",
"y'all'd've": "you all would have",
"y'all're": "you all are",
"y'all've": "you all have",
"you'd": "you had / you would",
"you'd've": "you would have",
"you'll": "you shall / you will",
"you'll've": "you shall have / you will have",
"you're": "you are",
"you've": "you have"
}
43
arturomp

Sie benötigen keine Bibliothek, es ist zum Beispiel mit Regex möglich.

>>> import re
>>> contractions_dict = {
...     'didn\'t': 'did not',
...     'don\'t': 'do not',
... }
>>> contractions_re = re.compile('(%s)' % '|'.join(contractions_dict.keys()))
>>> def expand_contractions(s, contractions_dict=contractions_dict):
...     def replace(match):
...         return contractions_dict[match.group(0)]
...     return contractions_re.sub(replace, s)
...
>>> expand_contractions('You don\'t need a library')
'You do not need a library'
15
alko

Die obigen Antworten werden perfekt funktionieren und könnten für mehrdeutige Kontraktionen besser geeignet sein (obwohl ich der Meinung bin, dass es nicht so viele mehrdeutige Fälle gibt). Ich würde etwas lesbarer und leichter zu pflegen verwenden:

import re

def decontracted(phrase):
    # specific
    phrase = re.sub(r"won\'t", "will not", phrase)
    phrase = re.sub(r"can\'t", "can not", phrase)

    # general
    phrase = re.sub(r"n\'t", " not", phrase)
    phrase = re.sub(r"\'re", " are", phrase)
    phrase = re.sub(r"\'s", " is", phrase)
    phrase = re.sub(r"\'d", " would", phrase)
    phrase = re.sub(r"\'ll", " will", phrase)
    phrase = re.sub(r"\'t", " not", phrase)
    phrase = re.sub(r"\'ve", " have", phrase)
    phrase = re.sub(r"\'m", " am", phrase)
    return phrase


test = "Hey I'm Yann, how're you and how's it going ? That's interesting: I'd love to hear more about it."
print(decontracted(test))
# Hey I am Yann, how are you and how is it going ? That is interesting: I would love to hear more about it.

Es könnte einige Mängel haben, an die ich nicht gedacht habe.

Wiedergegeben von meine andere Antwort

5
Yann Dubois

Dies ist eine sehr coole und einfach zu verwendende Bibliothek für den Zweck https://pypi.python.org/pypi/pycontractions/1.0.1 .

Anwendungsbeispiel (detailliert im Link):

from pycontractions import Contractions

# Load your favorite Word2vec model
cont = Contractions('GoogleNews-vectors-negative300.bin')

# optional, prevents loading on first expand_texts call
cont.load_models()

out = list(cont.expand_texts(["I'd like to know how I'd done that!",
                            "We're going to the Zoo and I don't think I'll be home for dinner.",
                            "Theyre going to the Zoo and she'll be home for dinner."], precise=True))
print(out)

Sie benötigen außerdem GoogleNews-Vectors-negative300.bin, einen Link zum Herunterladen im obenstehenden pycontractions-Link. * Beispielcode in python3. 

4
Joe9008

Ich möchte der Antwort von alko hier nur wenig hinzufügen. Wenn Sie in Wikipedia nachsehen, beträgt die Anzahl der dort genannten Kontraktionen in englischer Sprache weniger als 100. Zugegeben, diese Zahl könnte im realen Szenario mehr sein. Aber ich bin mir ziemlich sicher, dass 200-300 Wörter für englische Kontraktionswörter ausreichen. Wollen Sie eine eigene Bibliothek für diese erhalten (ich glaube jedoch nicht, dass das, wonach Sie suchen, tatsächlich existiert)? Sie können dieses Problem jedoch leicht mit Wörterbuch und Regex lösen. Ich würde empfehlen, einen Nice-Tokenizer als Natural Language Toolkit zu verwenden, und der Rest sollte kein Problem bei der Implementierung sein.

3

Ich habe dafür eine Bibliothek gefunden, contractions Es ist sehr einfach.

import contractions
print(contractions.fix("you've"))
print(contractions.fix("he's"))

Ausgabe:

you have
he is
2
Hammad Hassan

Obwohl dies eine alte Frage ist, dachte ich, dass ich genauso gut antworten könnte, da es für mich nach wie vor keine wirkliche Lösung gibt.

Ich musste an einem verwandten NLP-Projekt daran arbeiten, und ich beschloss, das Problem anzugehen, da es hier anscheinend nichts gab. Sie können mein expander github-Repository überprüfen wenn Sie interessiert sind.

Es ist ein ziemlich schlecht optimiertes (ich denke) Programm basierend auf NLTK, den Stanford Core NLP-Modellen, die Sie separat herunterladen müssen, und das Wörterbuch in der vorherigen Antwort . Alle erforderlichen Informationen sollten im README und im aufwändig kommentierten Code enthalten sein. Ich weiß, dass kommentierter Code toter Code ist, aber so schreibe ich, um die Dinge für mich klar zu halten.

Die Beispieleingabe in expander.py sind die folgenden Sätze:

    ["I won't let you get away with that",  # won't ->  will not
    "I'm a bad person",  # 'm -> am
    "It's his cat anyway",  # 's -> is
    "It's not what you think",  # 's -> is
    "It's a man's world",  # 's -> is and 's possessive
    "Catherine's been thinking about it",  # 's -> has
    "It'll be done",  # 'll -> will
    "Who'd've thought!",  # 'd -> would, 've -> have
    "She said she'd go.",  # she'd -> she would
    "She said she'd gone.",  # she'd -> had
    "Y'all'd've a great time, wouldn't it be so cold!", # Y'all'd've -> You all would have, wouldn't -> would not
    " My name is Jack.",   # No replacements.
    "'Tis questionable whether Ma'am should be going.", # 'Tis -> it is, Ma'am -> madam
    "As history tells, 'twas the night before Christmas.", # 'Twas -> It was
    "Martha, Peter and Christine've been indulging in a menage-à-trois."] # 've -> have

Zu welchem ​​die Ausgabe ist

    ["I will not let you get away with that",
    "I am a bad person",
    "It is his cat anyway",
    "It is not what you think",
    "It is a man's world",
    "Catherine has been thinking about it",
    "It will be done",
    "Who would have thought!",
    "She said she would go.",
    "She said she had gone.",
    "You all would have a great time, would not it be so cold!",
    "My name is Jack.",
    "It is questionable whether Madam should be going.",
    "As history tells, it was the night before Christmas.",
    "Martha, Peter and Christine have been indulging in a menage-à-trois."]

Für diesen kleinen Satz von Testsätzen habe ich mir ein paar Edge-Fälle ausprobiert. Das funktioniert gut.

Da dieses Projekt im Moment an Bedeutung verloren hat, entwickle ich es nicht mehr aktiv. Jede Hilfe zu diesem Projekt wäre dankbar. Die zu erledigenden Dinge sind in der TODO-Liste aufgeführt. Wenn Sie Tipps zur Verbesserung meines Pythons haben, wäre ich auch sehr dankbar.

0
Yannick