AcroMine 'Random' Corpus

Overview of this corpus

Figure 1 shows the number of contextual sentences (i.e., the number of occurrences) of short forms (occurring eight times or more) arranged in descending order. The x-axis is the list of short forms arranged in descending order of their occurrence. The most frequent short-form CT appears at the leftmost position, followed by frequent acronyms occupying the small region in the left side of the graph.

Distribution of short forms in MEDLINE

Figure 1. Distribution of short forms in MEDLINE


We chose 248 short forms every 300 entries from left to right in the graph, i.e., the 1st, 301th, 601th, ..., 74101th frequent short forms. In other words, we sampled 1/300 of the short forms at random appearing more than 8 times in the whole MEDLINE. We retrieved 32,910 contextual sentences for the short forms and collected 657 long-forms in a similar manner to the other corpora.

Making of this corpus

A bio-informatician colleague extracted long forms from the contextual sentences. As this was a time consuming task, we developed a tool where the expert can browse the list of contextual sentences efficiently. If the expert chooses a term as a long form, the tool eliminates the sentences with that long form automatically and reduces the amount of unexamined sentences.

The criteria for including long forms in the evaluation corpus were established:

  • a long form with minimum necessary elements (words) to produce its acronym is accepted
  • a long form with unnecessary elements, e.g., magnetic resonance imaging unit (MRI) or human immunodeficiency virus infection (HIV), is not accepted to keep the criteria for inclusion consistent
  • a misspelled long-form, e.g., hidden markvov model (HMM), is accepted to separate the acronym-recognition task from a spelling-correction task

Expressions satisfying the above criteria were accepted regardless of their popularity or relevance because it is hard for a human subject to determine which long forms are appropriate for the inclusion to a dictionary.

Download

To be released.

List of acronyms

The following table shows the complete list of short/long forms in this corpus. A short form in this corpus may have no distinct long-form because: all long-forms for a short form occur less than twice; or no long form should be extracted since the short form is not a valid acronym.

Table 1. List of the short forms and their statistics

Rank Short form # distinct longforms # contextual sentences
Total 657 32910
1 CT 257 32507
2 PCP 61 3606
3 CFTR 11 2079
4 USA 3 1441
5 C3 22 1085
6 ACD 42 893
7 CHX 5 734
8 QC 14 632
9 MAR 31 539
10 TUR 9 471
11 PSF 24 417
12 EPP 21 372
13 SEVERE 0 337
14 NSP 21 309
15 ARB 21 283
16 FNH 2 261
17 XI 6 240
18 MX 19 223
19 PRD 23 207
20 AOAA 4 193
21 DENA 4 181
22 16 H 0 171
23 ASPS 12 162
24 SUS 12 152
25 P70S6K 6 144
26 HACAT 1 138
27 BSH 5 132
28 1S,3R-ACPD 10 126
29 S+ 0 120
30 69 PERCENT 0 116
31 EC 3.1.1.3 0 111
32 PBGD 3 106
33 CYP7A1 1 102
34 5-10 MG/KG 0 98
35 D- 0 94
36 AZF 4 91
37 MSI-L 3 87
38 STL 14 84
39 UDC 2 81
40 DTMP 3 79
41 VICIA FABA 0 76
42 N20 0 74
43 G-A 0 72
44 GTE 2 70
45 GABA-IR 3 68
46 DDDS 3 66
47 EXCEPT ONE 0 64
48 FSU 5 62
49 VIL-10 2 60
50 HKC 3 59
51 7 MIN 0 58
52 PMCT 4 56
53 EG2 1 55
54 14L:10D 0 54
55 PYRUVATE 0 52
56 MNPCES 2 51
57 DOWNSTREAM 0 50
58 CA2 1 49
59 B-CELLS 0 48
60 AP3A 3 47
61 CARCINOMA 0 46
62 D GROUP 0 45
63 CT-SCAN 1 44
64 E-I 4 43
65 HR/HR 0 42
66 MAST CELLS 0 41
67 Q(A 0 40
68 SFV' 0 39
69 7:3, V/V 0 39
70 GENOTYPE 0 38
71 150 PPM 0 37
72 OR PLASMA 0 37
73 FASR 2 36
74 A CA(2+ 0 35
75 PRO-UPA 1 35
76 K(OW 0 34
77 SST1-5 1 33
78 CODE 1 33
79 ALLOGRAFT 0 32
80 PA1 2 32
81 AQUAPORINS 0 31
82 NIPAAM 1 31
83 N=203 0 30
84 ARK 3 30
85 PROMM 1 29
86 CT/MRI 3 29
87 30 MS 0 28
88 IHN 4 28
89 SP-I 4 28
90 OPB 4 27
91 CFG 4 27
92 IEPS 4 26
93 SEASONAL 0 26
94 AIK 2 26
95 AB- 2 25
96 IACT 4 25
97 RAT LIVER 0 25
98 RMCP II 2 24
99 AIJ 1 24
100 HPK 3 24
101 SJA 2 23
102 D(MAX 2 23
103 MAGNEVIST 0 23
104 1,000 PPM 0 23
105 HDTMA 2 22
106 ANGI 1 22
107 PERSANTINE 0 22
108 T14 1 21
109 GDIS 2 21
110 APND 2 21
111 NSILA 1 21
112 TFMS 2 20
113 C4-C6 0 20
114 HOLES 0 20
115 P<0.013 0 20
116 MG/KG/D 0 19
117 10 MA 0 19
118 ZPT 3 19
119 RHODAMINE 0 19
120 F8C 0 19
121 B-FABP 5 19
122 CHIRAL 0 18
123 NA+/H+ 0 18
124 HEMOLYSIS 0 18
125 RTES 1 18
126 3-HB 1 18
127 MERIT-HF 3 17
128 CK 19 2 17
129 TCNE 1 17
130 PITPS 1 17
131 GP30 0 17
132 50-60 GY 0 17
133 NARGHI 0 16
134 EBE 2 16
135 IL-1A 2 16
136 QMC 2 16
137 UFN 2 16
138 AUTM 2 16
139 22Q11 0 16
140 U87MG 0 15
141 ABF1 1 15
142 FR 30 1 15
143 10 BP 0 15
144 OR 2 0 15
145 CLASS 0 0 15
146 KNOBS 0 15
147 RMPS 1 15
148 R1-6 0 14
149 NWL 1 14
150 BIND 1 14
151 DLL1 1 14
152 TAU OFF 0 14
153 H2-AGONIST 0 14
154 6% EACH 0 14
155 0.005 MM 0 14
156 LEFT LOBE 0 14
157 HS- 2 13
158 CHV-1 3 13
159 EMBRYOS 0 13
160 M22 0 13
161 R=0.29 0 13
162 SW-13 0 13
163 OHDA 0 13
164 3 MUMOL 0 13
165 ALPHA-AT 2 13
166 STEEL 0 12
167 CANA 2 12
168 GABOB 2 12
169 ALDB 1 12
170 DMNT 1 12
171 OSATS 1 12
172 R/G 1 12
173 3-CB 2 12
174 VRNP 1 12
175 IONSPRAY 0 12
176 MFNS 2 12
177 MDC/CCL22 1 11
178 T11TS 1 11
179 GLUD 1 11
180 PGASE 1 11
181 RPTKS 1 11
182 EC 2.8.1.2 0 11
183 NOISY 0 11
184 8.8 MG/KG 0 11
185 CMBA 2 11
186 X-GLUC 0 11
187 20-25 G 0 11
188 IODIXANOL 0 11
189 BA6 1 11
190 THETA MAX 0 10
191 DRB1*0101 0 10
192 HEP 2 1 10
193 AEU 3 10
194 4-12 WEEKS 0 10
195 CLQ-BA 1 10
196 BETA1-4 0 10
197 SEM 0.10 0 10
198 14 HR 0 10
199 FREE DRUG 0 10
200 IWQOL-LITE 1 10
201 NIGERICIN 0 10
202 PCB 118 0 10
203 MDNCF 1 10
204 XOX 1 10
205 R2* 0 10
206 OPTIBOND 0 9
207 0.1-3 NMOL 0 9
208 SERVO NULL 0 9
209 M 1 0 9
210 PGOE 1 9
211 44 MICROM 0 9
212 TFPI1-161 0 9
213 AUTOPHAGY 0 9
214 WERNICKE 0 9
215 HERCEPTEST 0 9
216 2-3 KG 0 9
217 AAQ 0 9
218 E/G 2 9
219 CV/VC 0 9
220 FRES. 0 9
221 MTRPS 1 9
222 ISVP 3 9
223 CASP4 0 9
224 RC2 0 9
225 HUEC 1 8
226 6-13 YEARS 0 8
227 3.3 DAYS 0 8
228 JDP2 1 8
229 18:1T 0 8
230 GROUP PE 2 8
231 FIOCRUZ 1 8
232 DAPM 0 8
233 0.5 MUG 0 8
234 ASP-->GLU 0 8
235 LYS 4 0 8
236 V I 0 8
237 OTS/FOETS 0 8
238 PXAS 1 8
239 PGLUAP 1 8
240 NEK 0 8
241 BURNS 0 8
242 RPSO 0 8
243 MK-CSF 2 8
244 ADH-C2 0 8
245 TEF1 0 8
246 CIMR 1 8
247 SJS/TEN 2 8
248 EARSS 1 8