Lucene学习入门

你如星辰入海，倾万鲸成宇宙。
尚东峰-<鱼辞>

一.搜索引擎原理.

mark

二.什么是lucene

Lucene是一套用于全文检索和搜寻的开源程序库，由Apache软件基金会支持和提供.
Lucene提供了一个简单却强大的应用程序接口（API），能够做全文索引和搜寻，在Java开发环境里Lucene是一个成熟的免费开放源代码工具.
Lucene并不是现成的搜索引擎产品，但可以用来制作搜索引擎产品.
官网：http://lucene.apache.org/

1.什么是全文检索.如何实现.

计算机索引程序通过扫描文章中的每一个词，对每一个词建立一个索引，指明该词在文章中出现的次数和位置，当用户查询时，检索程序就根据事先建立的索引进行查找，并将查找的结果反馈给用户的检索方式.通过分词.

对要搜索的内容先创建索引,然后再通过索引进行搜索的过程.

倒排索引: 又叫反向索引,以字或词为关键字进行索引，表中关键字所对应的记录表项，记录了出现这个字或词的所有文档，每一个表项记录该文档的ID和关键字在该文档中出现的位置情况.

总结：对文档（数据）中每一个词都做索引.

2.索引和搜索的流程

mark

1、绿色表示索引过程，对要搜索的原始内容进行索引构建一个索引库，索引过程包括：
确定原始内容即要搜索的内容-->采集文档-->创建文档对象-->分析文档-->索引文档.
2、红色表示搜索过程，从索引库中搜索内容，搜索过程包括：
用户通过搜索界面-->创建查询-->执行搜索，从索引库搜索-->渲染搜索结果.

二.Lucene的基本使用

使用Lucene的API来实现对索引的增（创建索引）、删（删除索引）、改（修改索引）、查（搜索数据）

1.新建一个普通的maven项目.

2.导入pom相关依赖.

//(出现红叉记得maven/update project)
<dependencies>
        <!-- Junit单元测试 -->
        <dependency>
            <groupId>junit</groupId>
            <artifactId>junit</artifactId>
            <version>4.12</version>
        </dependency>
        <!-- lucene核心库 -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-core</artifactId>
            <version>4.10.3</version>
        </dependency>
        <!-- Lucene的查询解析器 -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-queryparser</artifactId>
            <version>4.10.3</version>
        </dependency>
        <!-- lucene的默认分词器库 -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-analyzers-common</artifactId>
            <version>4.10.3</version>
        </dependency>
        <!-- lucene的高亮显示 -->
        <dependency>
            <groupId>org.apache.lucene</groupId>
            <artifactId>lucene-highlighter</artifactId>
            <version>4.10.3</version>
        </dependency>
        <!-- io流 -->
        <dependency>
            <groupId>commons-io</groupId>
            <artifactId>commons-io</artifactId>
            <version>2.4</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <!-- java编译插件 -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.2</version>
                <configuration>
                    <source>1.7</source>
                    <target>1.7</target>
                    <encoding>UTF-8</encoding>
                </configuration>
            </plugin>
        </plugins>
    </build>

3.创建索引库

public class Lucene {
    @Test
    public void createIndex() throws Exception {
        // 创建目录对象，指定索引库的存放位置；FSDirectory文件系统；RAMDirectory内存
        // 索引库保存到内存中,一般不用
        // Directory directory=new RAMDirectory();
        // 保存到磁盘
        Directory directory = FSDirectory.open(new File("D:\\lucene\\index"));
        // 创建分词器对象
        Analyzer analyzer = new StandardAnalyzer();
        // 创建索引写入器配置对象，第一个参数版本VerSion.LATEST,第一个参数分词器
        IndexWriterConfig conf = new IndexWriterConfig(Version.LATEST, analyzer);
        // 创建一个索引写入器(IndexWriter),参数1:索引库存放的路径,参数2:配置信息,其中包括分词器对象.
        IndexWriter indexWriter = new IndexWriter(directory, conf);
        // 获得原始文档,使用io流读取文本文件
        File docPath=new File("D:\\lucene\\searchsource");
        for (File f : docPath.listFiles()) {
            //取文件名
            String fileName = f.getName();
            //取文件路径
            String filePath = f.getPath();
            //文件内容
            String fileContent = FileUtils.readFileToString(f);
            //文件大小
            long fileSize = FileUtils.sizeOf(f);
            // 创建文档对象
            Document document = new Document();
            //创建域
            //参数1：域的名称 参数2：域的内容 参数3：是否存储
            TextField fileNameField = new TextField("name", fileName, Store.YES);
            StoredField filePathField = new StoredField("path", filePath);
            TextField fileContentField = new TextField("content", fileContent, Store.NO);
            LongField fileSizeField = new LongField("size", fileSize, Store.YES);
            //5、向文档中添加域(term)
            document.add(fileNameField);
            document.add(filePathField);
            document.add(fileContentField);
            document.add(fileSizeField);
            //6、把文档对象写入索引库
            indexWriter.addDocument(document);
        }
        // 关闭IndexWriter对象
        indexWriter.close();
    }
}

运行创建索引库成功:

mark

4.使用luke工具查看索引文件

运行start.bat打开.

mark

文档列表:

mark

搜索页面:

mark

5.查询索引器

//查询索引库
    @Test
    public void searchIndex() throws Exception {
        //1指定索引库存放的位置
        Directory directory = FSDirectory.open(new File("D:\\lucene\\index"));
        //2使用IndexReader对象打开索引库
        IndexReader indexReader = DirectoryReader.open(directory);
        //3创建一个IndexSearcher对象，构造方法需要一个indexReader对象
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        //4创建一个查询对象,需要指定查询域及要查询的关键字。
        //term的参数1：要搜索的域, 参数2：搜索的关键字
        Query query = new TermQuery(new Term("name", "apache"));
        //参数1：查询条件, 参数2：查询结果返回的最大值
        //5取查询结果
        TopDocs topDocs = indexSearcher.search(query, 10);
        //取查询结果总记录数
        System.out.println("查询结果总记录数："  + topDocs.totalHits);
        //6遍历查询结果并打印.
        for (ScoreDoc scoreDoc : topDocs.scoreDocs) {
            //取文档id
            int id = scoreDoc.doc;
            //从索引库中取文档对象
            Document document = indexSearcher.doc(id);
            //取属性
            System.out.println(document.get("name"));
            System.out.println(document.get("size"));
            System.out.println(document.get("content"));
            System.out.println(document.get("path"));
        }
        //7关闭IndexReader对象
        indexReader.close();
    }

6.IK中文分词器

优点: 中文分词更专业,可以扩展自定义词库(扩展词典和停用词典).

<!-- 引入IK分词器 -->
        <dependency>
            <groupId>com.janeluo</groupId>
            <artifactId>ikanalyzer</artifactId>
            <version>2012_u6</version>
        </dependency>
//将自定义词库的三个配置文件放到classpath下.
//exit.dic  扩展词典
//stopword.dic  停用词典
//IKAnalyzer  ik配置文件

//查看IK分析器的分词效果
    @Test
    public void testAnanlyzer() throws Exception {
        //创建一个分析器对象
        //标准分词器
        //Analyzer analyzer = new StandardAnalyzer();
        //Analyzer analyzer = new CJKAnalyzer();
        //智能中文分词器
        //Analyzer analyzer = new SmartChineseAnalyzer();
        //IK中文分词器
        Analyzer analyzer = new IKAnalyzer();
        //从分析器对象中获得tokenStream对象
        //参数1：域的名称，可以为null或者""
        //参数2：要分析的文本内容
        TokenStream tokenStream = analyzer.tokenStream("", "设置一个引用，引用可以有多重类型，可以时候关键词的引用、偏移量的引用");
        //设置一个引用，引用可以有多重类型，可以时候关键词的引用、偏移量的引用
        CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
        //偏移量
        OffsetAttribute offsetAttribute = tokenStream.addAttribute(OffsetAttribute.class);
        //调用tokenStream的reset方法
        tokenStream.reset();
        //使用while循环变量单词列表
        while (tokenStream.incrementToken()) {
            System.out.println("start->" + offsetAttribute.startOffset());
            //打印单词
            System.out.println(charTermAttribute);
            System.out.println("end->" + offsetAttribute.endOffset());
        }
        //关闭tokenStream
        tokenStream.close();
    }

7.添加文档

//添加文档
    @Test
    public void addDocument() throws Exception {
        Directory directory = FSDirectory.open(new File("D:\\lucene\\index"));
        Analyzer analyzer = new IKAnalyzer();
        //参数1：lucene的版本号，第二个参数：分析器对象
        IndexWriterConfig conf = new IndexWriterConfig(Version.LATEST, analyzer);
        IndexWriter indexWriter = new IndexWriter(directory, conf);
        //创建Document对象
        Document document = new Document();
        //创建域
        TextField FileNameField = new TextField("name", "测试文件.txt", Store.YES);
        StoredField FilepathField = new StoredField("path", "D:\\lucene\\测试文件.txt");
        document.add(FileNameField);
        document.add(FilepathField);
        //写入索引库
        indexWriter.addDocument(document);
        //关闭资源
        indexWriter.close();
    }

8.删除文档

public IndexWriter getIndexWriter() throws Exception {
        Directory directory = FSDirectory.open(new File("D:\\lucene\\index"));
        Analyzer analyzer = new IKAnalyzer();
        //参数1：lucene的版本号，第二个参数：分析器对象
        IndexWriterConfig conf = new IndexWriterConfig(Version.LATEST, analyzer);
        IndexWriter indexWriter = new IndexWriter(directory, conf);
        return indexWriter;
    }
//删除全部文档(慎用!!!)
    @Test
    public void deleteAllDocument() throws Exception {
        //获得IndexWriter对象
        IndexWriter indexWriter = this.getIndexWriter();
        //调用删除方法删除索引库
        indexWriter.deleteAll();
        //关闭资源
        indexWriter.close();
    }
//根据查询条件删除
    @Test
    public void deleteDocumentByQuery() throws Exception {
        IndexWriter indexWriter = this.getIndexWriter();
        //指定查询条件
        Query query = new TermQuery(new Term("name", "apache"));
        //删除文档
        indexWriter.deleteDocuments(query);
        //关闭资源
        indexWriter.close();
    }

9.更新索引库

//更新索引库(本质:先删除后添加)
    @Test
    public void updateDocument() throws Exception {
        IndexWriter indexWriter = this.getIndexWriter();
        //创建一个新的文档对象
        Document document = new Document();
        document.add(new TextField("name", "更新后的文档", Store.YES));
        document.add(new TextField("content", "更新后的文档内容", Store.YES));
        //term对象：指定要删除域及要删除的关键词，先根据term查询，把查询结果删除，然后追加一个新的文档。
        indexWriter.updateDocument(new Term("name", "spring"), document);
        //关闭资源
        indexWriter.close();
    }

10.Query子类查询-查询所有文档

//查询所有文档
public class SearchIndex {
    @Test
    public void testMatchAllDocsQuery() throws Exception {
        //指定索引库存放的路径
        Directory directory = FSDirectory.open(new File("D:\\lucene\\index"));
        //创建一个IndexReader对象
        IndexReader indexReader = DirectoryReader.open(directory);
        //创建IndexSearcher对象
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        //创建一个Query对象,匹配所有文档查询
        Query query = new MatchAllDocsQuery();
        System.out.println(query);
        //查询索引库
        TopDocs topDocs = indexSearcher.search(query, 100);
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        System.out.println("查询结果总记录数：" + topDocs.totalHits);
        //遍历查询结果
        for (ScoreDoc scoreDoc : scoreDocs) {
            int docId = scoreDoc.doc;
            //通过id查询文档对象
            Document document = indexSearcher.doc(docId);
            //取属性
            System.out.println(document.get("name"));
            System.out.println(document.get("size"));
            System.out.println(document.get("content"));
            System.out.println(document.get("path"));
        }
        //关闭索引库
        indexReader.close();
    }
}

//提取重复代码
private IndexSearcher getIndexSearcher() throws Exception {
        //指定索引库存放的路径
        Directory directory = FSDirectory.open(new File("D:\\lucene\\index"));
        //创建一个IndexReader对象
        IndexReader indexReader = DirectoryReader.open(directory);
        //创建IndexSearcher对象
        IndexSearcher indexSearcher = new IndexSearcher(indexReader);
        return indexSearcher;
    }
    private void printResult(IndexSearcher indexSearcher, Query query) throws Exception {
        //查询索引库
        TopDocs topDocs = indexSearcher.search(query, 100);
        ScoreDoc[] scoreDocs = topDocs.scoreDocs;
        System.out.println("查询结果总记录数：" + topDocs.totalHits);
        //遍历查询结果
        for (ScoreDoc scoreDoc : scoreDocs) {
            int docId = scoreDoc.doc;
            //通过id查询文档对象
            Document document = indexSearcher.doc(docId);
            //取属性
            System.out.println(document.get("name"));
            System.out.println(document.get("size"));
            System.out.println(document.get("content"));
            System.out.println(document.get("path"));
        }

11.数值范围查询

@Test
    public void testNumericRangeQuery() throws Exception {
        //创建一个数值范围查询对象
        //参数1：要查询的域 参数2：最小值 参数3：最大值 参数4：是否包含最小值 参数5：是否包含最大值
        Query query = NumericRangeQuery.newLongRange("size", 1000l, 10000l, false, true);
        System.out.println(query);
        //打印结果
        printResult(getIndexSearcher(), query);   
    }

12.组合条件查询

//Occur.MUST：必须满足此条件，相当于and
//Occur.SHOULD：应该满足，但是不满足也可以，相当于or
//Occur.MUST_NOT：必须不满足。相当于not
//组合条件查询
    @Test
    public void testBooleanQuery() throws Exception {
        //创建一个BooleanQuery对象
        BooleanQuery query = new BooleanQuery();
        //创建子查询，文件大于1000小于10000
        //Query query1 = NumericRangeQuery.newLongRange("size", 1000l, 10000l, true, true);
        Query query1 = new TermQuery(new Term("name", "lucene"));
        //文件名中包含mybatis关键字
        Query query2 = new TermQuery(new Term("name", "apache"));
        //添加到BooleanQuery对象中
        query.add(query1, Occur.MUST);
        query.add(query2, Occur.MUST_NOT);
        System.out.println(query);
        //执行查询
        printResult(getIndexSearcher(), query);
    }

13.使用Queryparse查询

通过QueryParser也可以创建Query，QueryParser提供一个Parse方法，此方法可以直接根据查询语法来查询。Query对象执行的查询语法可通过System.out.println(query);查询。需要使用到分析器。建议创建索引时使用的分析器和查询索引时使用的分析器要一致。

    @Test
    public void testQueryParser() throws Exception {
        //创建一个QueryParser对象。参数1：默认搜索域 参数2：分析器对象。
        QueryParser queryParser = new QueryParser("content", new IKAnalyzer());
        //调用parse方法可以获得一个Query对象
        //参数：要查询的内容，可以是一句话。先分词在查询
        Query query = queryParser.parse("mybatis is a apache project");
//        Query query = queryParser.parse("name:lucene OR name:apache");
        System.out.println(query);
        printResult(getIndexSearcher(), query);
    }

14.Lucene查询语法

1、基础的查询语法，关键词查询：
域名+“：”+搜索的关键字
例如：content:java
2、范围查询
域名+“:”+[最小值 TO 最大值]
例如：size:[1 TO 1000]
范围查询在lucene中不支持数值类型，支持字符串类型。在solr中支持数值类型。
3、组合条件查询
1）+条件1 +条件2：两个条件之间是并且的关系and
例如：+filename:apache +content:apache
2）+条件1 条件2：必须满足第一个条件，应该满足第二个条件
例如：+filename:apache content:apache
3）条件1 条件2：两个条件满足其一即可。
例如：filename:apache content:apache
4）-条件1 条件2：必须不满足条件1，要满足条件2
例如：-filename:apache content:apache
Occur.MUST 查询条件必须满足，相当于and    +（加号）
Occur.SHOULD 查询条件可选，相当于or
    空（不用符号）
Occur.MUST_NOT 查询条件不能满足，相当于not非    -（减号）

第二种写法：
条件1 AND 条件2
条件1 OR 条件2
条件1 NOT 条件2

15.指定多个默认搜索域

@Test
public void testMultiFileQueryParser() throws Exception {
    //指定默认搜索域
    String[] fields ={"name", "content"};
    MultiFieldQueryParser queryParser = new MultiFieldQueryParser(fields, new IKAnalyzer());
    Query query = queryParser.parse("mybatis is a apache project");
    System.out.println(query);
    printResult(getIndexSearcher(), query);
}