Wikipedia构造平行语料

参考论文 Learning To Split and Rephrase From Wikipedia Edit History

该论文主要分析的是如何通过维基百科的编辑历史文件一步步产生高质量的英文版平行语料库,论文中的方法主要是四步

To construct the WikiSplit corpus, we identify edits that involve sentences being split. A list of sentences for each snapshot is obtained by stripping HTML tags and Wikipedia markup and running a sentence break detector.

首先筛选出需要被分割的候选句子，根据xml文件中的时间快照字段以及维基百科本身的标记将其拆分成一些句子列表，然后用论文中给出的分句检测器工具将其拆成一个个句子。

Temporally adjacent snapshots of a Wikipedia page are then compared to check for sentences that have undergone a split like that shown in Figure 1.

然后通过维基百科提供的每个页面相邻的时间快照来对比核查该句子是否经历了正确的分割，其实论文图1中指的就是增加一些字段和删除一些字段。

To extract a full sentence C and its candidate split into S = (S1,S2),we require that C and S1 have the same trigram prefix, C and S2 have the same trigram suffix, and S1 and S2 have different trigram suffixes. To filter out misaligned pairs, we use BLEU scores (Papineni et al., 2002) to ensure similarity between the original and the split versions.
we discard pairs where BLEU(C,S1) or BLEU(C, S2) is less than δ (an empirically chosen threshold). If multiple candidates remain for a given sentence C, we retain argmaxS (BLEU(C, S1 ) + BLEU(C, S2 )).

为了保证源语句和被拆分成的两个子句之间达到一定的相似度，论文里的要求是保证C和S1的trigram 前缀相同，和S2的trigram 后缀相同，S1和S2的trigram后缀不同，还用到了BLEU指标来评定两个句子间的相似度，主要是设定阈值的方式来筛选的。

Our extraction heuristic is imperfect, so we manually assess corpus quality using the same categorization schema proposed by Aharoni and Goldberg.

这一步主要是人为的操作来评定句子的质量。

严格按照论文在英文数据复现得到的结果如下

A surge of popular interest in anarchism occurred during the [[1970’s]] in [[Britain]] following the birth of the [[punk rock]] movement.
A surge of popular interest in anarchism occurred during the [[1960’s]] and [[1970’s]].
In [[Britain]] following the birth of the [[punk rock]] movement.

主要目的是想构造中文的平行语料,得到结果如下

诗是历史最悠久的文学形式，中国是世界上诗歌最发达的国度之一。
诗是历史最悠久的文学形式之一，例如荷马史诗。
中国是世界上诗歌最发达的国度之一。

从事定性分析的部分社会学家相信这是一种更好的方法，他们认为，这是一种可以有助于我们对一个 “ 离散 ” 性的社会和独特性的人文的了解，这种方法从不寻求有一致观点，但它们却可以互相欣赏各自所采取的独特方式并互相借鉴。
从事定性分析的社会学家相信，这是一种更好的方法，因为这可以加强理解 “ 离散 ” 性的社会和独特性的人文。
这种方法从不寻求有一致观点，但却可以互相欣赏各自所采取的独特方式并互相借鉴。

十月殿试，乾隆帝亲往紫光阁校阅，见到马全，发现相识，问道：「尔马瑔耶？
十月殿试，乾隆帝亲往紫光阁校阅。
马全脱颖而出，乾隆帝竟将其认出，问道：「尔马瑔耶？

洪武年间，明朝大军进攻云南，改平缅为麓川平缅军民宣慰司，才首次使用 “ 麓川 ” 。
洪武年间，明朝大军进攻云南，其国君思伦发归顺明朝，授麓川宣慰使。
改平缅为麓川平缅军民宣慰司，才首次使用 “ 麓川 ” 。