[爆卦]get the point解答是什麼?優點缺點精華區懶人包

為什麼這篇get the point解答鄉民發文收入到精華區:因為在get the point解答這個討論話題中,有許多相關的文章在討論,這篇最有參考價值!作者rod24574575 (天然呆)看板NTU-Exam標題[試題] 103下 陳信希 自然語言處...


課程名稱︰自然語言處理
課程性質︰選修
課程教師:陳信希
開課學院:電資學院
開課系所︰資工所、網媒所、知識管理學程
考試日期(年月日)︰2015.04.23
考試時限(分鐘):180分鐘

試題 :
Natural Language Processing Mid-Term Exam

2015/4/23

Problem 1
What is the prepositional phrase attachment problem in natural language
processing? (5 points)

Answer: (See Lec.1 p.49. See Lec.2 p.11, 12.)
In natural language processing, there are lots of issue about language
ambiguity, such as lexical ambiguity and structure ambiguity. And PP
attachment is one source of structure ambiguity which refers to the
circumstance where PP can attach to sentences, verb phrases, noun phrases,
and adjectival phrases. For example:

I observed the boy with a telescope.

Here [with a telescope] is a PP and there are two possible PP attaching
choices:
(1) The PP can attach to [the boy] to form a NP first and couple with
[observe] to consist of a VP.
(2) The [observe] and [the boy] form a VP first and the PP attach to this
VP to consist of a new VP.

Note that: If you mention "ambiguity" or "ambiguous", you will already
get one point. And if you mention "belong to" or "attachment" but your
explanation is incomplete, you may get extra one to two points. Finally,
the completeness of your answer is worth two points. Often, giving a proper
example will indicate your good understanding of what the PP attachment
problem is and therefore can earn you higher scores.


Problem 2
Non-local dependency is a challenge for n-gram approach. Please explain
why. (5 points)

Answer: (See Lec.2 p.18. See Lec.4 p.16, 17, 18.)
A non-local dependency is an instance in which two words can be
syntactically dependent even though they occur far apart in a sentence (e.g.,
subject-verb agreement; long-distance dependencies such as wh-extraction).
While for n-gram model, it only considers n-1 previous words due to the
Markov assumption. So If n is not large enough it will catch this linguistic
phenomena and if n is too large, the number of n-gram will grow exponentially
and is therefore impossible.

Note that: Explanation about what non-local dependency is worth two
points. And explanation about the why challenge is worth two points. Finally
the completeness of the answer is worth one point.


Problem 3
Please use the sentence, "the boys grill their catches on an open fire", to
explain the semantic role labelling problem. (5 points).

Answer:
Semantic role labeling, sometimes also called shallow semantic parsing, is a
task in natural language processing consisting of the detection of the semantic
arguments associated with the predicate or verb of a sentence and their
classification into their specific roles. For example:

[Cook the boys] ... grill [Food their catches] [Heating_instrument on an
open fire].


Problem 4
Please name 3 linguistic resources. (5 points)

Answer:
Corpus
1. Academia Sinica Balanced Corpus
2. Penn Chinese Treebank
3. Sinica Treebank
4. Sinica Bilingual WordNet
5. WordNet
6. FrameNet
7. Dbpedia
8. etc...
or Tools
1. NLTK
2. Jieba Chinese Segmenter
3. Stanford parser
4. CKIP parser
5. etc...


Problem 5
Pointwise mutual information (PMI) is defined as follows.
(a) Please describe the association between x and y by PMI. (5 points)
(b) In which case PMI is arguable? Please describe how to deal with this
problem. (5 points)

Answer: (See Lec.3 p.52, 61, 63, 64, 65, 66, 69, 70.)
(a) If x and y are independent, PMI is 0. And if x and y are positively
correlated, PMI is more than 0. And if they are negatively correlated
PMI is less than 0.

Note that: If you only mention perfect independent and perfect dependent
cases, you can only get three to four points depending on the completeness
of you answer.

(b) When perfectly dependent:
P(xy) P(x) 1
I(x, y) = log ───── = log ───── = log ───
P(x)P(y) P(x)P(y) P(y)
As the perfectly dependent bi-grams get rarer, their mutual information
increases → bad measure of dependence. That is, the score depends on the
frequency of the individual words. Solution: PMI * Word Count(frequency).

Note that: Entropy, Smoothing and Log-likelihood Ratio are also regarded
as correct alternatives. But χ^2 test will get one point reduction, since
Likelihood ratios are more appropriate for sparse data than the χ^2 test
according to the slide (Lec.3 P.52).


Problem 6
Log likelihood ratio is one of measurements for relation discovery. Please
define it first and then use it to determine the polarity of a potential
opinionated word in a corpus for a specific domain such as hotel and
restaurant. (10 points)

Answer: (See Lec.3 p.52, 53, 54, 55, 56 for the first part of the answer.)
To determine the polarity of a potential opinionated word, we first choose
some seed words that have known polarity, e.g. nice, bad. Afterwards, we
compute log likelihood ratio of potential words with the seed words. If a
candidate is highly related to seed words of one specific polarity, the
candidate may have the same polarity with those words.

Alternative Answer: (See Lec.3 p.59, 60 for the first part of the answer.)
To determine the polarity of a potential opinionated word, we first divide
the corpus for hotel and restaurant to two sets according to their polarity:
positive set, negative set. Afterwards, we compute the likelihood ratios for
the potential word between two corpora. Those words that have extreme values
for likelihood ratios could potentially have the same polarity with the corpus
where the words occur often.


Problem 7
In authorship verification, you have to answer whether a given document is
written by a specific author. Please describe how language modelling is
employed to this application. (5 points)

Answer: (See Lec.4 p.30, 31, 40, 41, 42. See Lec.3 p.29, 51.)
To determine whether a document is written by a specific author we can take
various methods:
(1) We can train a language model from the previous works of that author, and
use this language model to generate that given document like Shannon's
Method. If the probability is higher than a certain threshold or its
perplexity is low enough it can be verified to be written by one specific
author.
(2) Or we can train two language models, one is from the previous works of
that author and the other is from the given document. And then compare
the distance between two distributions using methods like KL-divergence
or statistical testing method like Chi-square, T-test et al. to decide
whether these two distribution is statistically different from each other
or not.


Problem 8
Please discuss why maximum likelihood estimate is not suitable for language
modelling. (5 points)

Answer:
MLE is usually unsuitable for NLP because of the sparseness of the data.
For example, if we use trigram model with 60000 word vocabulary, there are
216000000 million trigrams, but a large training set may only contain
200 million words. Many trigrams will not be observed in the training data.
Thus, MLE gives them zero probabilities. However, many of these trigrams are
reasonable sequences that have small but non-zero probability to occur in the
testing data.
Also, when we try to estimate the probability of a given sentence, we
multiply the probabilities of all trigrams. So if one trigram has zero
probability, the whole sentence would have zero probability as well.
Thus, we usually use smoothing to redistribute the probabilities to unseen
events.


Problem 9
Assume you are fishing and you have caught 10 carp, 3 perch, 2 whitefish,
2 catfish, 1 trout, 1 salmon, and 1 eel.
(a) How likely is it that the next fish to be caught is a catfish if good
tuning method is adopted? (3 points)
(b) How likely is it that the next fish caught will be a member of newly seen
species if good tuning method is adopted? (4 points)
(c) Assume bass is the only one new species. How likely is it that the next
fish to be caught is a bass if Laplace smoothing is adopted? (3 points)

Answer:
(a) c*_(GT) = (2 + 1) * 1/2 = 1.5, Prob = 1.5/20 (See Lec.5 p.41, 42.)
(b) 3/20 (See Lec.5 p.45. "To estimate total number of unseen species...")
(c) 1/28 (See Lec.5 p.20.)


Problem 10
Forward probability and backward probability for an HMM, λ(A, B, π), are
defined recursively as follows, respectively.
(a) Please specify how to compute the probability of a sequence,
w_1, w_2, ..., w_T, without enumerating all the paths to generate the
sequence. (5 points)
(b) Please specify how to select the most probable path without enumerating
all the paths to generate the sequence. (5 points)
(c) Assume s_i and s_j are two states in an HMM. Given an observation sequence,
w_1, w_2, ..., w_T, please specify how to compute the count transiting
from s_i to s_j without enumerating all the paths to generate the sequence.
(5 points)

Answer:
(a) Use Forward Algorithm.
(1) Initialization
α_1(j) = a_(0j) * b_j(o_1); 1≦j≦N
(2) Recursion
N
v_t(j) = Σ α_(t-1)(i) * a_(ij) * b_j(o_t); 1≦j≦N, 1<t≦T
i=1
(3) Termination:
N
P(Ο|λ) = α_T(q_F) = Σ α_T(i) * a_(iF)
i=1
The sequence probability is P(Ο|λ).

(b) Use Viterbi Algorithm.
(1) Initialization
v_1(j) = a_(0j) * b_j(o_1); 1≦j≦N
bt_1(j) = 0
(2) Recursion
N
v_t(j) = max(v_(t-1)(i) * a_(ij) * b_j(o_t)); 1≦j≦N, 1<t≦T
i=1
N
bt_t(j) = arg max(v_(t-1)(i) * a_(ij) * b_j(o_t)); 1≦j≦N, 1<t≦T
i=1
(3) Termination:
N
The best score: P_(*) = v_t(q_F) = max(v_T(i) * a_(i,F))
i=1
N
The start of backtrace: q_(T*) = bt_T(q_F) = arg max(v_T(i) * a_(i,F))
i=1
To find the best path, use backtrace.

(c) We use Forward & Backward Algorithm to compute α & β.
And as in E-step of Forward-Backward Algorithm.
α_t(i) * a_(ij) * b_j(o_(t+1)) * β_(t+1)(j)
ξ_t(i, j) = ────────────────────────
α_T(q_F)
T-1
The estimated count from s_i to s_j is computed by: Σ ξ_t(i, j)
t=1


Problem 11
Part of speech (POS) tagging is often modeled as sequence classification.
Hidden Markov Model (HMM) and Maximum Entropy Markov Model (MEMM) can be
adopted to deal with sequence labelling problem.
(a) Please formulate the POS tagging with HMM. (5 points)
(b) Please formulate the POS tagging with MEMM. (5 points)
(c) Please describe how to get the best POS tag sequence with MEMM without
enumerating all the possible paths. (5 points)

Answer: (See Lec.7 p.27, 28, 40, 56, 57, 59, 60, 61.)
(a) Using HMM model in POS tagging.
When using HMM model, each state in the model is the POS tag for a certain
word. And we want out of all sequences of n tags t_1 ... t_n the single
tag sequence such that:

P(t_1 ... t_n | w_1 ... w_n) is highest,
that is (^t)^n_1 = argmax P(t^n_1 | w^n_1).
t^n_1

Based on Bayes Rule it can be re-written as:
(^t)^n_1 = argmax P(w^n_1 | t^n_1) P(t^n_1)
t^n_1
n
PS. t^n_1 代表 t 的上標是 n 且下標是 1,看起來就像 t 這樣
1
︿
(^t) 代表 t 上面有一個 ^ 的重音符號,看起來就像 t 這樣
~~ 代表上下各一個「~」組成的符號,也就是波浪版的「=」

Using the Markov Assumption, the likelihood and prior can be re-written as:
n
P(w^n_1 | t^n_1) ~~ Π P(w_i | t_i)
i=1
n
P(t^n_1) ~~ Π P(t_i | t_(i-1))
i=1

Finally the POS tagging result is:
n
(^t)^n_1 ~~ argmax Π P(w_i | t_i) * P(t_i | t_(i-1))
t^n_1 i=1

HMM is a generative model that optimizes the likelihood P(W|T).

(b) Using HEMM model in POS tagging.
Unlike the HMM model, HEMM model compute the posterior P(T|W) directly.
n
(^T) = argmax P(T|W) = argmax Π P(t_i | w_i, t_(i-1))
T T i=1
1
And P(t_i | w_i, t_(i-1)) = ──────── * exp(Σ w_i * f_i(w_i, t_i)),
Z(w_i, t_(i-1)) i
w is trained feature weight and f_i(w_i, t_i) is extracted feature from
the word sequence.

HEMM model is a discriminative model which trains the model directly to
discriminate among the possible tag sequences.

(c) Using Viterbi and HEMM model to tag POS.
This is similar with HMM and we can simply replace the Viterbi equation
from
N
V_t(j) = max(V_(t-1)(i) * P(t_j | t_i) * P(w_t | t+j); 1≦j≦N, 1<t≦T
i=1
to
N
V_t(j) = max(V_(t-1)(i) * P(t_j | t_i, w_t); 1≦j≦N, 1<t≦T
i=1
and it computes the Viterbi value of word position t for tag j.

And we also use a matrix S to store the MAX prob path for each tag in each
word position.
N
S_t(j) = argmax(V_(t-1)(i) * P(t_j | t_i, w_t); 1≦j≦N, 1<t≦T.
i=1

In addition, for corner case t = 1:
V_1(j) = P(tj | w_1, <s>); 1≦j≦N, <s> is the start symbol.
S_1(j) = 0; 1≦j≦N


Problem 12
Most natural language processing is notation transformation. A sentence may
be mapped into different forms on different levels with some information added.
Given a sentence, Sheikh Mohammed, who is also the Defense Minister of the
United Arab Emirates, announced at the inauguration ceremony, we want to make
Dubai a new trading center., what are the outputs of
(a) entity extraction and
(b) co-reference resolution? (10 points)

Answer:
(a) Entity Extraction: (See Lec.2 p.53.)
(1) Sheikh Mohammed
(2) who
(3) Defense Minister
(4) United Arab Emirates
(5) inauguration ceremony
(6) we
(7) Dubai
(8) trading center
Note that: You should write at least three of these to gain a full score.

(b) co-reference resolution. (See Lec.2 p.54.)
(1) Sheikh Mohammed = who
(2) who = Defense Minister
(3) United Arab Emirates = we
(4) Dubai = trading center
Note that: You should write at least three of these to gain a full score.

--
※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 111.250.48.19
※ 文章網址: https://www.ptt.cc/bbs/NTU-Exam/M.1433732204.A.5E6.html
rod24574575 : 已收資訊系精華區! 06/08 11:00
t0444564 : 好長一篇! 06/10 23:07

你可能也想看看

搜尋相關網站