Recent work in machine learning for information extraction has focused on two distinct sub-problems: the conventional problem of filling template slots from natural language text, and the problem of wrapper induction, learning simple ex- traction procedures (wrappers) for highly structured text such asWeb pages produced by CGI scripts. For suitably reg- ular domains, existing wrapper induction algorithms can effi- ciently learnwrappers that are simple and highly accurate, but the regularity bias of these algorithms makes them unsuitable for most conventional information extraction tasks. Boost- ing is a technique for improving the performance of a simple machine learning algorithm by repeatedly applying it to the training set with different example weightings. We describe an algorithm that learns simple, low-coverage wrapper-like extraction patterns, which we then apply to conventional in- formation extraction problems using boosting. The result is BWI, a trainable information extraction system with a strong precision bias and F1 performance better than state-of-the-art techniques in many domains.
Mendeley saves you time finding and organizing research
Choose a citation style from the tabs below