AcroMine 'Paper' Corpus

Overview of this corpus

This evaluation corpus employs 50 short forms chosen from those discussed in papers on acronym recognition. The criterion focuses on system performance for acronyms chosen by the previous studies as examples. The corpus includes 3,362 short/long-form pairs extracted manually from the 32,910 sentences that contain the target 657 short form.

Making of this corpus

A bio-informatician colleague extracted long forms from the contextual sentences. As this was a time consuming task, we developed a tool where the expert can browse the list of contextual sentences efficiently. If the expert chooses a term as a long form, the tool eliminates the sentences with that long form automatically and reduces the amount of unexamined sentences.

The criteria for including long forms in the evaluation corpus were established:

  • a long form with minimum necessary elements (words) to produce its acronym is accepted
  • a long form with unnecessary elements, e.g., magnetic resonance imaging unit (MRI) or human immunodeficiency virus infection (HIV), is not accepted to keep the criteria for inclusion consistent
  • a misspelled long-form, e.g., hidden markvov model (HMM), is accepted to separate the acronym-recognition task from a spelling-correction task

Expressions satisfying the above criteria were accepted regardless of their popularity or relevance because it is hard for a human subject to determine which long forms are appropriate for the inclusion to a dictionary.

Download

To be released.

List of acronyms

The following table shows the complete list of short/long forms in this corpus. A short form in this corpus may have no distinct long-form because: all long-forms for a short form occur less than twice; or no long form should be extracted since the short form is not a valid acronym.

Table 1. List of the short forms and their statistics

Rank Short form # distinct longforms # contextual sentences
Total 657 32910
1 ATP 39 4993
2 PKA 20 4205
3 5-HIAA 22 2319
4 ABC 59 2204
5 JNK 8 2203
6 EMS 50 1981
7 HGH 8 1728
8 NHS 24 1389
9 ADM 23 1314
10 MMS 49 1192
11 FG 52 1054
12 BHLH 2 806
13 BHA 15 642
14 SOD1 2 618
15 SRF 18 581
16 HSF 27 522
17 HMM 14 506
18 IND 24 478
19 GDP 14 460
20 GNRH-A 14 453
21 ASR 35 434
22 ATN 13 398
23 POL II 1 390
24 TSK 8 380
25 AW 24 376
26 TTF-1 1 231
27 CASR 3 146
28 DOP 21 143
29 OAC 10 126
30 CTH 10 104
31 GMP-140 3 79
32 TAPS 11 57
33 CFDA 3 45
34 SQDG 4 42
35 NLO 4 41
36 PTSMA 1 39
37 PN2 3 36
38 ATF3 1 35
39 ADRB2 3 31
40 ADMR 2 23
41 TRU 3 21
42 NESP 1 20
43 PBRO2 2 16
44 GNAT 1 15
45 PERE 1 9
46 EWI 0 7
47 14C-UBT 1 6
48 CSNBX 1 6
49 CNS1 1 4
50 3-NO2-TYR 1 2