论文部分内容阅读
一、语料概述
名著《红楼梦》电子文本的基本信息列表如下(表1):

本论文分析所用《红楼梦》的语料是从http://ling.ccnu.edu.cn/ylk/gudian.htm网址下载获得。该语料共120回,其中除了汉字以外还包含210个特殊符号,比如非汉字符号、图形符号、结构符、标点、阿拉伯数字、日语、拉丁字母等等。
下面是这些特殊符号的列表,按这些特殊符号的显示特征分两组列出:一组是可以看到的,也就是可显示的;另一组是无显示的,虽然在文本中看不到符号,但是都有各自的码位。第一组(表2)共有173个特殊符号,第二组(表3)共37个特殊符号,由于第二组符号无显示,因此我们把十六进制的编码附在了括号中。

这些符号都是《红楼梦》的组成部分,但是本文主要是考虑可显示的文字特征,所以在做分析和统计的时候并没有考虑这些非汉字特殊符号的作用。考虑到标点符号所占比率较高,我们会在下面专门对标点符号进行一些分析和说明。
二、对标点符号的统计
标点符号也表示一定的语义,对小说的理解和语言的表达都有一定的作用。
统计数据显示,《红楼梦》中非汉字字符出现的次数是137850,其中标点符号出现的次数是137540。已知小说字符总数是868996,标点符号占了 137540/868996≈0.158,也就是说,小说中有15.8%都是标点符号。其中频次排在最高的两个标点符号是逗号和句号,同时也是小说字符中频率最高的字符,这说明了小说中重复出现次数最多的符号不是汉字而是标点符号。逗号和句号重复出现次数分别高达59357和29400。利用句号、感叹号这种句子结束符,我们可以大致推测小说的规模,但是具体到小说的内容,单单看这些数据是无能为力的。只有将小说中的字词和标点符号结合起来才能更好地理解和解析文本。下面开始研究小说中的单个字。
三、字频统计和字关联
考虑到非汉字符号对分析《红楼梦》没有太大的贡献,所以在对字频进行统计时并没有考虑非汉字符号。因此我们现在的数据信息是:汉字的个数(不重复)是4316,出现总数(重复)是731146。我们不可能对四千多个汉字都一一进行分析,有些汉字可能只出现过一次或者几次,所以我们选择了有代表性的,即出现频率在0.1%以上的汉字作为研究目标。在选择代表性的汉字时,我们可以以出现次数的累计总和占所有汉字的出现总数(731146)的过半作为标准;但是观察了统计数据后我们发现,一些助词的出现比率远远高于表达具体语义的动词和名词,所以我们最终选择了出现频率0.1%以上的汉字,从结果可以看出这样的选择是可取的。
出现频率在0.1%以上的汉字共有194个,总数为502347,占小说总数的68.7%,基本上涵盖了将近70%的字数,但是汉字的个数却只占了 194/4316≈0.0449,还不到5%。下面按频次的高低列出《红楼梦》中所有高频汉字。由于汉字很多,所以每个汉字的信息都用分号隔开,每组汉字的信息包括:字,出现次数,百分比(之间用逗号隔开)。
了,21193,0.028986;的,15720,0.0215;不,15025,0.02055;一,12149,0.016616;来,11429,0.01563;道,11059,0.0151;人,10542,0.0144;是,10142,0.01387;说,9692,0.013256;我,9173,0.012546;这,7810,0.010682;他,7737,0.01058;你,7142,0.009768;去,6186,0.00846;着,6166,0.00843;也,6106,0.00835;儿,6074,0.008308;玉,6051,0.008276;有,5987,0.008189;宝,5820,0.00796;个,5656,0.007736;子,5466,0.007476;又,5220,0.007139;贾,5201,0.00711;里,5143,0.00703;那,4909,0.00671;们,4893,0.00669;见,4804,0.00657;只,4677,0.006397;太,4302,0.00588;便,4078,0.005578;好,4042,0.005528;在,4002,0.00547;笑,3957,0.00541;家,3917,0.005357;上,3809,0.0052;么,3670,0.00502;得,3610,0.004937;大,3466,0.00474;姐,3443,0.004709;头,3403,0.00465;听,3301,0.004515;就,3253,0.004449,出,3225,0.00441;回,3070,0.004199;知,2922,0.003996;日,2917,0.00399;要,2903,0.00397;下,2775,0.003795;都,2677,0.00366;心,2655,0.00363;事,2641,0.00361;二,2630,0.003597;老,2602,0.003559;过,2584,0.00353;话,2504,0.003425;还,2496,0.0034;起,2477,0.003388;自,2455,0.003358;如,2357,0.0032;看,2353,0.003218;叫,2267,0.0031;到,2243,0.003068;没,2243,0.003068;两,2230,0.00305;母,2206,0.003017;些,2172,0.00297;时,2156,0.002949;之,2139,0.002926;今,2117,0.002895;小,2020,0.00276;问,2001,0.002737;因,1977,0.0027;凤,1949,0.002666;奶,1947,0.00266;等,1938,0.00265;娘,1871,0.002559;可,1863,0.002548;什,1855,0.002537;呢,1826,0.002497;忙,1822,0.00249;夫,1805,0.002469;想,1792,0.00245;面,1781,0.002436;爷,1773,0.002425;才,1771,0.0024;中,1672,0.002287;王,1661,0.00227;打,1588,0.00217;进,1548,0.002117;此,1538,0.0021;倒,1534,0.002098;罢,1525,0.002086;样,1507,0.00206;吃,1455,0.00199;和,1453,0.001987;正,1411,0.0019;几,1400,0.001915;无,1400,0.001915;姑,1395,0.001908;后,1388,0.001898;黛,1383,0.00189;天,1362,0.00186;然,1292,0.001767;前,1281,0.00175;为,1274,0.00174;意,1261,0.001725;别,1253,0.0017;再,1253,0.0017;门,1242,0.001699;丫,1232,0.001685;走,1222,0.00167;外,1221,0.00167;袭,1213,0.001659;作,1212,0.001658;怎,1206,0.001649;三,1203,0.001645;众,1189,0.001626;妹,1188,0.001625;方,1170,0.0016;生,1170,0.0016;多,1164,0.00159;明,1157,0.00158;将,1156,0.00158;已,1150,0.00157;身,1142,0.00156;把,1141,0.00156;以,1133,0.00155;气,1125,0.001539;钗,1119,0.0015;何,1117,0.001528;亲,1087,0.001487;给,1077,0.00147;拿,1066,0.001458;与,1059,0.001448;手,1054,0.00144;坐,1054,0.00144;年,1048,0.00143;若,1038,0.0014;十,1036,0.001417;用,1036,0.001417;请,1031,0.0014;房,1027,0.001405;发,993,0.001358;薛,993,0.001358;且,991,0.001355;春,983,0.001344;妈,979,0.001339;政,978,0.001338;命,972,0.001329;姨,959,0.0013;原,952,0.00130;花,950,0.001299;所,948,0.001297;处,934,0.001277;先,909,0.00124;边,904,0.001236;谁,902,0.001234;己,899,0.00123;平,899,0.00123;瞧,895,0.001224;琏,892,0.00122;内,888,0.001215;住,887,0.001213;管,886,0.001212;女,880,0.001204;死,866,0.001184;送,856,0.001171;连,834,0.001141;至,831,0.001137;告,830,0.001135;早,823,0.001126;会,817,0.001117;东,815,0.001115;香,812,0.001111;林,807,0.001104;往,802,0.001097;西,802,0.001097;月,797,0.00109;带,794,0.001086;虽,790,0.00108;应,785,0.001074;必,772,0.001056;从,770,0.001053;口,767,0.001049;分,765,0.001046;怕,761,0.001041;声,758,0.001037;四,754,0.001031;当,746,0.00102;放,745,0.001019;能,744,0.001018;未,744,0.001018;云,736,0.001007
根据上面的统计数据,我们可以看出:
1)《红楼梦》中虚词使用频率相当高,包括:了、的、不、着、也、个、又、得、就、还、之……
虽然虚词比实词少,但是意义却比较复杂,一般都作为实词的修饰成分,它们和实词组合后产生各种语义。虚词的作用只能搬到小说中根据它的搭配来进行理解和分析。
2)名词比率也很高,例如:人、儿、子、玉、宝、贾、家、姐、头、母、凤、奶、娘、夫、爷、王、姑、黛、丫、妹、薛、妈、姨、女等等。
从这些使用频率高的名词可以看出,《红楼梦》主要是围绕人展开的,主体是讲贾、王、史、薛四大家族的事情。主人公的名字当中用的“宝”“玉”和“黛”等字频率也较高。再根据这些名词之间的联系,我们可以推测这是一个大家族,有儿有女,爷、奶、母、姐、妹、姑、夫俱全,而且女人的角色占较大比率。如果把“丫”字和“头”字组合,也可以推测《红楼梦》讲述的应该是丫头众多的有钱大户人家的事情。
3)频率高的动词:来、道、是、去、有、见、笑、听、出、知、要、看、叫、到、死……
从这些动词的特点很难推测《红楼梦》中人物的主要活动,这些动词在文中可能有很多词性,看单字只会想到歧义,无法正确理解它们在文中的确切含义。所以动词之间的联系和小说内容之间的关系还得在小说文本中联系上下文进行分析。
4)还有一些频率高的名词,比如:香、月、云、花、春等等。通过这些字,也容易联想到《红楼梦》中应该不乏诗情画意和浪漫的爱情。
其实这些高频字中隐隐约约也包含了作者使用语言的特点,同时,对每一回进行一次字频的统计,可以在某种程度上推测故事发展的细微变化、贯穿出小说的主题思路。
四、总结
从上面的统计数据可以看出,高频字虽然很少,但在小说表达故事内容时却占有举足轻重的地位。不过文中指的高频字是占小说总字数的0.1%以上的字,根据上面的分析,我们取的出现频率0.1%以上的字在全文中占的百分比接近70%,所以说出现频率0.1%以上的字基本上可以作为小说高频字的代表。这也说明了这些字在小说中占据的分量。虽然它们个数不多,却是小说表意的中心所在。取0.1%上的字可以大致推测出小说中的主要角色、内容的大体趋向等,如果要给出更确切的观点和解释,还需要回到文本中进一步分析并获得更详细的数据信息,不能完全靠频率推测。
这些字大多为名词、动词、代词和助词,这也说明了小说在文字应用上的特性,这些汉字在时代变迁中应用的变化不是很大,基本上保持在高频词的位置上。
下一步我们将进一步深入该项研究,从分析字扩展到分析词汇,从《红楼梦》扩展到其他名著,从中找出它们的共同点和不同点,进而总结语言的发展变化规律,探讨字词和故事情节之间的紧密联系。
参考文献:
[1] http://www.yp.edu.sh.cn/sflxx/mingren/01-12/caoxq.htm
[2]孙展.关于“红楼”的真实与猜想[J].中国新闻周刊,2006,(38).
[3]曹洁.谈《红楼梦》语言世界的“偏离”[J].平顶山学院学报,2006,(3).
[4]王绍新.《红楼梦》词汇与现代词汇的词义比较研究[J].语言教学与研究,2002,(3).
[5]孔昭琪.《红楼梦》的词语活用[J].泰安师专学报,2000,(4).
[6]于平.试论“红楼梦语言”形成的社会文化因素[J].南京师大学报(社会科学版),1999,(6).
[7]李小明 王亚莉.自动分词中的单字虚词处理[A].http://chinese.fudan.edu.cn/phoneticslab/yuyin5/papers/07-10-089.pdf
(那日松 吉日嘎拉,中国传媒大学播音主持艺术学院)
名著《红楼梦》电子文本的基本信息列表如下(表1):

本论文分析所用《红楼梦》的语料是从http://ling.ccnu.edu.cn/ylk/gudian.htm网址下载获得。该语料共120回,其中除了汉字以外还包含210个特殊符号,比如非汉字符号、图形符号、结构符、标点、阿拉伯数字、日语、拉丁字母等等。
下面是这些特殊符号的列表,按这些特殊符号的显示特征分两组列出:一组是可以看到的,也就是可显示的;另一组是无显示的,虽然在文本中看不到符号,但是都有各自的码位。第一组(表2)共有173个特殊符号,第二组(表3)共37个特殊符号,由于第二组符号无显示,因此我们把十六进制的编码附在了括号中。

这些符号都是《红楼梦》的组成部分,但是本文主要是考虑可显示的文字特征,所以在做分析和统计的时候并没有考虑这些非汉字特殊符号的作用。考虑到标点符号所占比率较高,我们会在下面专门对标点符号进行一些分析和说明。
二、对标点符号的统计
标点符号也表示一定的语义,对小说的理解和语言的表达都有一定的作用。
统计数据显示,《红楼梦》中非汉字字符出现的次数是137850,其中标点符号出现的次数是137540。已知小说字符总数是868996,标点符号占了 137540/868996≈0.158,也就是说,小说中有15.8%都是标点符号。其中频次排在最高的两个标点符号是逗号和句号,同时也是小说字符中频率最高的字符,这说明了小说中重复出现次数最多的符号不是汉字而是标点符号。逗号和句号重复出现次数分别高达59357和29400。利用句号、感叹号这种句子结束符,我们可以大致推测小说的规模,但是具体到小说的内容,单单看这些数据是无能为力的。只有将小说中的字词和标点符号结合起来才能更好地理解和解析文本。下面开始研究小说中的单个字。
三、字频统计和字关联
考虑到非汉字符号对分析《红楼梦》没有太大的贡献,所以在对字频进行统计时并没有考虑非汉字符号。因此我们现在的数据信息是:汉字的个数(不重复)是4316,出现总数(重复)是731146。我们不可能对四千多个汉字都一一进行分析,有些汉字可能只出现过一次或者几次,所以我们选择了有代表性的,即出现频率在0.1%以上的汉字作为研究目标。在选择代表性的汉字时,我们可以以出现次数的累计总和占所有汉字的出现总数(731146)的过半作为标准;但是观察了统计数据后我们发现,一些助词的出现比率远远高于表达具体语义的动词和名词,所以我们最终选择了出现频率0.1%以上的汉字,从结果可以看出这样的选择是可取的。
出现频率在0.1%以上的汉字共有194个,总数为502347,占小说总数的68.7%,基本上涵盖了将近70%的字数,但是汉字的个数却只占了 194/4316≈0.0449,还不到5%。下面按频次的高低列出《红楼梦》中所有高频汉字。由于汉字很多,所以每个汉字的信息都用分号隔开,每组汉字的信息包括:字,出现次数,百分比(之间用逗号隔开)。
了,21193,0.028986;的,15720,0.0215;不,15025,0.02055;一,12149,0.016616;来,11429,0.01563;道,11059,0.0151;人,10542,0.0144;是,10142,0.01387;说,9692,0.013256;我,9173,0.012546;这,7810,0.010682;他,7737,0.01058;你,7142,0.009768;去,6186,0.00846;着,6166,0.00843;也,6106,0.00835;儿,6074,0.008308;玉,6051,0.008276;有,5987,0.008189;宝,5820,0.00796;个,5656,0.007736;子,5466,0.007476;又,5220,0.007139;贾,5201,0.00711;里,5143,0.00703;那,4909,0.00671;们,4893,0.00669;见,4804,0.00657;只,4677,0.006397;太,4302,0.00588;便,4078,0.005578;好,4042,0.005528;在,4002,0.00547;笑,3957,0.00541;家,3917,0.005357;上,3809,0.0052;么,3670,0.00502;得,3610,0.004937;大,3466,0.00474;姐,3443,0.004709;头,3403,0.00465;听,3301,0.004515;就,3253,0.004449,出,3225,0.00441;回,3070,0.004199;知,2922,0.003996;日,2917,0.00399;要,2903,0.00397;下,2775,0.003795;都,2677,0.00366;心,2655,0.00363;事,2641,0.00361;二,2630,0.003597;老,2602,0.003559;过,2584,0.00353;话,2504,0.003425;还,2496,0.0034;起,2477,0.003388;自,2455,0.003358;如,2357,0.0032;看,2353,0.003218;叫,2267,0.0031;到,2243,0.003068;没,2243,0.003068;两,2230,0.00305;母,2206,0.003017;些,2172,0.00297;时,2156,0.002949;之,2139,0.002926;今,2117,0.002895;小,2020,0.00276;问,2001,0.002737;因,1977,0.0027;凤,1949,0.002666;奶,1947,0.00266;等,1938,0.00265;娘,1871,0.002559;可,1863,0.002548;什,1855,0.002537;呢,1826,0.002497;忙,1822,0.00249;夫,1805,0.002469;想,1792,0.00245;面,1781,0.002436;爷,1773,0.002425;才,1771,0.0024;中,1672,0.002287;王,1661,0.00227;打,1588,0.00217;进,1548,0.002117;此,1538,0.0021;倒,1534,0.002098;罢,1525,0.002086;样,1507,0.00206;吃,1455,0.00199;和,1453,0.001987;正,1411,0.0019;几,1400,0.001915;无,1400,0.001915;姑,1395,0.001908;后,1388,0.001898;黛,1383,0.00189;天,1362,0.00186;然,1292,0.001767;前,1281,0.00175;为,1274,0.00174;意,1261,0.001725;别,1253,0.0017;再,1253,0.0017;门,1242,0.001699;丫,1232,0.001685;走,1222,0.00167;外,1221,0.00167;袭,1213,0.001659;作,1212,0.001658;怎,1206,0.001649;三,1203,0.001645;众,1189,0.001626;妹,1188,0.001625;方,1170,0.0016;生,1170,0.0016;多,1164,0.00159;明,1157,0.00158;将,1156,0.00158;已,1150,0.00157;身,1142,0.00156;把,1141,0.00156;以,1133,0.00155;气,1125,0.001539;钗,1119,0.0015;何,1117,0.001528;亲,1087,0.001487;给,1077,0.00147;拿,1066,0.001458;与,1059,0.001448;手,1054,0.00144;坐,1054,0.00144;年,1048,0.00143;若,1038,0.0014;十,1036,0.001417;用,1036,0.001417;请,1031,0.0014;房,1027,0.001405;发,993,0.001358;薛,993,0.001358;且,991,0.001355;春,983,0.001344;妈,979,0.001339;政,978,0.001338;命,972,0.001329;姨,959,0.0013;原,952,0.00130;花,950,0.001299;所,948,0.001297;处,934,0.001277;先,909,0.00124;边,904,0.001236;谁,902,0.001234;己,899,0.00123;平,899,0.00123;瞧,895,0.001224;琏,892,0.00122;内,888,0.001215;住,887,0.001213;管,886,0.001212;女,880,0.001204;死,866,0.001184;送,856,0.001171;连,834,0.001141;至,831,0.001137;告,830,0.001135;早,823,0.001126;会,817,0.001117;东,815,0.001115;香,812,0.001111;林,807,0.001104;往,802,0.001097;西,802,0.001097;月,797,0.00109;带,794,0.001086;虽,790,0.00108;应,785,0.001074;必,772,0.001056;从,770,0.001053;口,767,0.001049;分,765,0.001046;怕,761,0.001041;声,758,0.001037;四,754,0.001031;当,746,0.00102;放,745,0.001019;能,744,0.001018;未,744,0.001018;云,736,0.001007
根据上面的统计数据,我们可以看出:
1)《红楼梦》中虚词使用频率相当高,包括:了、的、不、着、也、个、又、得、就、还、之……
虽然虚词比实词少,但是意义却比较复杂,一般都作为实词的修饰成分,它们和实词组合后产生各种语义。虚词的作用只能搬到小说中根据它的搭配来进行理解和分析。
2)名词比率也很高,例如:人、儿、子、玉、宝、贾、家、姐、头、母、凤、奶、娘、夫、爷、王、姑、黛、丫、妹、薛、妈、姨、女等等。
从这些使用频率高的名词可以看出,《红楼梦》主要是围绕人展开的,主体是讲贾、王、史、薛四大家族的事情。主人公的名字当中用的“宝”“玉”和“黛”等字频率也较高。再根据这些名词之间的联系,我们可以推测这是一个大家族,有儿有女,爷、奶、母、姐、妹、姑、夫俱全,而且女人的角色占较大比率。如果把“丫”字和“头”字组合,也可以推测《红楼梦》讲述的应该是丫头众多的有钱大户人家的事情。
3)频率高的动词:来、道、是、去、有、见、笑、听、出、知、要、看、叫、到、死……
从这些动词的特点很难推测《红楼梦》中人物的主要活动,这些动词在文中可能有很多词性,看单字只会想到歧义,无法正确理解它们在文中的确切含义。所以动词之间的联系和小说内容之间的关系还得在小说文本中联系上下文进行分析。
4)还有一些频率高的名词,比如:香、月、云、花、春等等。通过这些字,也容易联想到《红楼梦》中应该不乏诗情画意和浪漫的爱情。
其实这些高频字中隐隐约约也包含了作者使用语言的特点,同时,对每一回进行一次字频的统计,可以在某种程度上推测故事发展的细微变化、贯穿出小说的主题思路。
四、总结
从上面的统计数据可以看出,高频字虽然很少,但在小说表达故事内容时却占有举足轻重的地位。不过文中指的高频字是占小说总字数的0.1%以上的字,根据上面的分析,我们取的出现频率0.1%以上的字在全文中占的百分比接近70%,所以说出现频率0.1%以上的字基本上可以作为小说高频字的代表。这也说明了这些字在小说中占据的分量。虽然它们个数不多,却是小说表意的中心所在。取0.1%上的字可以大致推测出小说中的主要角色、内容的大体趋向等,如果要给出更确切的观点和解释,还需要回到文本中进一步分析并获得更详细的数据信息,不能完全靠频率推测。
这些字大多为名词、动词、代词和助词,这也说明了小说在文字应用上的特性,这些汉字在时代变迁中应用的变化不是很大,基本上保持在高频词的位置上。
下一步我们将进一步深入该项研究,从分析字扩展到分析词汇,从《红楼梦》扩展到其他名著,从中找出它们的共同点和不同点,进而总结语言的发展变化规律,探讨字词和故事情节之间的紧密联系。
参考文献:
[1] http://www.yp.edu.sh.cn/sflxx/mingren/01-12/caoxq.htm
[2]孙展.关于“红楼”的真实与猜想[J].中国新闻周刊,2006,(38).
[3]曹洁.谈《红楼梦》语言世界的“偏离”[J].平顶山学院学报,2006,(3).
[4]王绍新.《红楼梦》词汇与现代词汇的词义比较研究[J].语言教学与研究,2002,(3).
[5]孔昭琪.《红楼梦》的词语活用[J].泰安师专学报,2000,(4).
[6]于平.试论“红楼梦语言”形成的社会文化因素[J].南京师大学报(社会科学版),1999,(6).
[7]李小明 王亚莉.自动分词中的单字虚词处理[A].http://chinese.fudan.edu.cn/phoneticslab/yuyin5/papers/07-10-089.pdf
(那日松 吉日嘎拉,中国传媒大学播音主持艺术学院)