如何使用java编写一个爬虫程序？

Csyor · 浏览 1505 · 点赞 0 · 评论 1 · 11年前 (2014-04-14)

实际通过使用现有的API编写一个Java 爬虫程序并不难,但是编写自己的爬虫可以让你根据自己的需求自定义编写每一个方法，这应该是非常有趣的。从网上搜索得到相关具体信息和提供的代码并不容易,但Csyor还是搜索和找到一个爬虫的基本算法，这里分享一下。

首先你会确定一个你想要实现的效果，但是最基础的步骤是一下几点：

1、未访问过URL的List–初始化需要访问的URLs

2、访问过的URL的List–排除的URLs

3、存放你不感兴趣的URL的Set-访问规则

4、把这些存储在数据库中是必要的，因为程序可能停止并需要重新启动，这样就不会丢失状态

基本流程如下：

while(list of unvisited URLs is not empty) {//循环未访问过URL的List直到为空停止
    //从未访问过URL的List中获得URL
    //抓取内容
    //记录抓取到的内容
    if content is HTML {//如果抓取到的内容是HTML
        //从<a>链接中解析出URLs
        foreach URL {//循环得到的URLs
            //如果得到的URL符合你设定的规则
            //并且如果这个URl并没有出现在你初始化的URLs和排除的URLs中
            //添加这个URl到初始化的URLs的List
        }
    }
}

如果你决定做一个爬虫程序，让我们来讨论一下。现在，这里的问题是URL列表。

哪里可以找到的网站名单？我猜你只需要一些现有的目录或某个地方，甚至是手动找到它。

Jsoup是一个HTML解析器这可能使解析部分非常容易和有趣的事情。

下面的代码应该还是有问题。

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
 
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
 
public class FileCrawler {
 
    public static void main(String[] args) throws IOException {
 
        File dir = new File(".");
        String loc = dir.getCanonicalPath() + File.separator + "record.txt";
        FileWriter fstream = new FileWriter(loc, true);
        BufferedWriter out = new BufferedWriter(fstream);
        out.newLine();
        out.close();
 
        processPage("http://cis.udel.edu");
 
        File file = new File(loc);
 
        if (file.delete()) {
 
        }
    }
 
    // givn a String, and a File
    // return if the String is contained in the File
    public static boolean checkExist(String s, File fin) throws IOException {
 
        FileInputStream fis = new FileInputStream(fin);
        //Construct the BufferedReader object
        BufferedReader in = new BufferedReader(new InputStreamReader(fis));
 
        String aLine = null;
        while ((aLine = in.readLine()) != null) {
            //Process each line
            if (aLine.trim().contains(s)) {
                //System.out.println("contains " + s);
                in.close();
                fis.close();
                return true;
            }
        }
 
        // do not forget to close the buffer reader
        in.close();
        fis.close();
 
        return false;
    }
 
    public static void processPage(String URL) throws IOException {
 
        File dir = new File(".");
        String loc = dir.getCanonicalPath() + File.separator + "record.txt";
 
        // invalid link
        if (URL.contains(".pdf") || URL.contains("@") 
            || URL.contains("adfad") || URL.contains(":80")
            || URL.contains("fdafd") || URL.contains(".jpg")
            || URL.contains(".pdf") || URL.contains(".jpg"))
            return;
 
        // process the url first
        if (URL.contains("cis.udel.edu") && !URL.endsWith("/")) {
 
        } else if(URL.contains("cis.udel.edu") && URL.endsWith("/")){
            URL = URL.substring(0, URL.length()-1);
        }else{
            // url of other site -> do nothing
            return;
        }
 
        File file = new File(loc);
 
        // check existance
        boolean e = checkExist(URL, file);
        if (!e) {
            System.out.println("------ :  " + URL);
            // insert to file
            FileWriter fstream = new FileWriter(loc, true);
            BufferedWriter out = new BufferedWriter(fstream);
            out.write(URL);
            out.newLine();
            out.close();
 
            Document doc = null;
            try {
                doc = Jsoup.connect(URL).get();
            } catch (IOException e1) {
                e1.printStackTrace();
                return;
            }
 
            if (doc.text().contains("PhD")) {
                //System.out.println(URL);
            }
 
            Elements questions = doc.select("a[href]");
            for (Element link : questions) {
                processPage(link.attr("abs:href"));
            }
        } else {
            // do nothing
            return;
        }
 
    }
}

原文链接：http://www.programcreek.com/2009/11/how-to-write-a-java-crawler/