Scraping the web with the power of the crowd
The web is the most valuable source of information, but the majority of this information is not automatically processable. To “scrape” this information people rely on Web Scraping tools that navigate the website, collect the interesting pages and generate a wrapper to give a structure to the published information.
Fully automated approaches for scraping the web have been already proposed (e.g. RoadRunner [Crescenzi and Merialdo, AAI 2008]), but they exhibit limited accuracy. On the other side, supervised tools have limited applicability at the web scale. The crowd could be the trigger for addressing the problem of scraping very large numbers of data intensive Web sites with high accuracy.
We propose ALFRED [Crescenzi et al. WWW2013], a web scraping system supervised by the crowd. To generate wrappers, the system poses sequences of simple questions that require a boolean answer (e.g. “Is ‘City of God’ the title of the movie in the page?” Y/N). The answers provided by the workers recruited on a crowdsourcing platform are exploited to generate the correct wrapper.
Preliminary results are promising:
- To generate accurate wrappers, just a few queries are needed. Even in presence of inaccurate workers, ALFRED can generate a correct wrapper with less than 15 queries.
- The accuracy of the output wrapper is highly predictable, with an average F-measure close to 100% and its standard deviation less than 1%, i.e., almost perfect wrapper with a small variability.
- Workers’ error rates estimation is accurate, and spammers and unreliable workers are early detected.
- Costs are contained and highly predictable thanks to a technique to dynamically engage, at runtime, a minimal number of workers, with 92% of the cases covered by just two workers.
Many challenges are still open:
- to further reduce the costs we aim at adopting a hybrid approach that partially relies on automatic wrapper generation techniques, with a light supervision by the crowd
- gamification is a promising direction to engage workers and scale out the wrappers generation. People can play games while teaching ALFRED how to wrap the web.
For more, see our full paper, A framework for learning web wrappers from the crowd.