2015년 12월 8일 화요일

django haystack, elasticsearch 그리고 은전한닢 연동(2)

4. 은전한닢이 적용된 elasticsearch와 haystack의 연동

기본적으로 다음 링크의 방법을 참고하여 진행했다.

https://wellfire.co/learn/custom-haystack-elasticsearch-backend/

이 때 클래스 class ConfigurableElasticBackend(ElasticsearchSearchBackend): 에서 정의하는 defulat analyzer를 반드시 5번에서 설정한 analyzer값으로 바꿔야한다.(여기서는 korean_index)

결과적으로 haystack의 elasticsearch관련 backend 설정하는 부분을 상속 받아서 수정하는 방법이다. haystack자체를 수정하지 않으면서 custom analyzer를 사용할 수 있는 좋은 방법인 것 같아 사용했다.

아래는 위의 블로그에서 볼 수 있는 설명의 전문
-------------------------------------------------------------------------------------------------

lot of feature requirements in Django projects are solved by domain specific third-party modules that smartly fit the bill and end up becoming something of a community standard. For search, Haystack is that touchstone: it supports some of the most common search engines and its API closely mirrors that of existing Django APIs, making it easy for developers to get started.
We’ve been using Haystack with a Lucene backed engine called ElasticSearch - you know, for search. Unlike the popular Solr search engine, ElasticSearch uses schema-free JSON instead of XML and runs as a binary without requiring an external Java server. For our needs it optimizes simplicity and power.
Note: ElasticSearch support is only available in Haystack 2.0.0 beta. To use it you’ll need to grab the code from source, not PyPI.

What ElasticSearch can do

Rather than simply filtering your content, a search engine performs textual matching. Unlike aLIKE query in SQL, the query and indexed content can be provided with different relevancy weights, language characteristics can be chosen, and even synonyms. And it can do across different types of content, or rather, different types of ‘documents’.
The search engine does so by tokenizing and filtering the content - both indexed content and query terms. ElasticSearch allows you to configure how these are used, and you can add your own as well. With the available filters and tokenizers, you can add in analyzers that reference different languages, use custom stop words, and filter on synonyms. You update the index based on the index configuration.
Here’s an example from the ElasticSearch docs for setting up an analyzer to filter on synonyms using a provided synonym file.
{
   "index" : {
       "analysis" : {
           "analyzer" : {
               "synonym" : {
                   "tokenizer" : "whitespace",
                   "filter" : ["synonym"]
               }
           },
           "filter" : {
               "synonym" : {
                   "type" : "synonym",
                   "synonyms_path" : "analysis/synonym.txt"
               }
           }
       }
   }
}
This looks like a pretty useful feature until you realize that Haystack’s ElasticSearch backend only supports a default setting configuration. Here’s what our index settings look like (source).
DEFAULT_SETTINGS = {
   'settings': {
       "analysis": {
           "analyzer": {
               "ngram_analyzer": {
                   "type": "custom",
                   "tokenizer": "lowercase",
                   "filter": ["haystack_ngram"]
               },
               "edgengram_analyzer": {
                   "type": "custom",
                   "tokenizer": "lowercase",
                   "filter": ["haystack_edgengram"]
               }
           },
           "tokenizer": {
               "haystack_ngram_tokenizer": {
                   "type": "nGram",
                   "min_gram": 3,
                   "max_gram": 15,
               },
               "haystack_edgengram_tokenizer": {
                   "type": "edgeNGram",
                   "min_gram": 2,
                   "max_gram": 15,
                   "side": "front"
               }
           },
           "filter": {
               "haystack_ngram": {
                   "type": "nGram",
                   "min_gram": 3,
                   "max_gram": 15
               },
               "haystack_edgengram": {
                   "type": "edgeNGram",
                   "min_gram": 2,
                   "max_gram": 15
               }
           }
       }
   }
}
And here’s the snippet showing how these are used (source).
if current_mapping != self.existing_mapping:
   try:
       # Make sure the index is there first.
       self.conn.create_index(self.index_name, self.DEFAULT_SETTINGS)
       self.conn.put_mapping(self.index_name, 'modelresult', current_mapping)
       self.existing_mapping = current_mapping
   except Exception:
       if not self.silently_fail:
           raise
The settings configure two nGram analyzers for Haystack, but we’re left without a way of changing the filter or tokenizer attributes, or of adding a new analyzer.

Using custom index settings

The solution, for the time being, is to use a custom search backend. The first step is to update the settings used for updating the index. Here’s a custom backend extending the original.
from django.conf import settings
from haystack.backends.elasticsearch_backend import ElasticsearchSearchBackend

class ConfigurableElasticBackend(ElasticsearchSearchBackend):

   def __init__(self, connection_alias, **connection_options):
       super(ConfigurableElasticBackend, self).__init__(
                               connection_alias, **connection_options)
       user_settings = getattr(settings, 'ELASTICSEARCH_INDEX_SETTINGS')
       if user_settings:
           setattr(self, 'DEFAULT_SETTINGS', user_settings)
This extended backend does nothing more than look for a custom settings dictionary in your project settings file and then replace the backend settings with your own. But now we can swap out those settings.

Choosing a new default analyzer

Even though we’ve updated the settings, our changes are still unavailable. Haystack assigns the specific analyzer to each search field based on a hard coded analyzer.
The default analyzer for non-nGram fields is the “snowball” analyzer. The snowball analyzer is basically a stemming analyzer, which means it helps piece apart words that might be components or compounds of others, as “swim” is to “swimming”, for instance. It also adds in a stop word filter, which removes common words from entering the index, such as common prepositions and articles. The analyzer is also language specific, which could be problematic since the default language is English and to change this you need to specify the language in the index settings.
Here’s the snippet in which the default analyzer is set in the build_schema method, with minor formatting changes for this page (source).
if field_mapping['type'] == 'string' and field_class.indexed:
   field_mapping["term_vector"] = "with_positions_offsets"

   if not hasattr(field_class, 'facet_for') and not \
           field_class.field_type in('ngram', 'edge_ngram'):
       field_mapping["analyzer"] = "snowball"
The chosen analyzer should be configurable, so let’s make it so.
class ConfigurableElasticBackend(ElasticsearchSearchBackend):

   DEFAULT_ANALYZER = "snowball"

   def __init__(self, connection_alias, **connection_options):
       super(ConfigurableElasticBackend, self).__init__(
                               connection_alias, **connection_options)

       user_settings = getattr(settings, 'ELASTICSEARCH_INDEX_SETTINGS')
       user_analyzer = getattr(settings, 'ELASTICSEARCH_DEFAULT_ANALYZER')

       if user_settings:
           setattr(self, 'DEFAULT_SETTINGS', user_settings)
       if user_analyzer:
           setattr(self, 'DEFAULT_ANALYZER', user_analyzer)

   def build_schema(self, fields):
       content_field_name, mapping = super(ConfigurableElasticBackend,
                                             self).build_schema(fields)

       for field_name, field_class in fields.items():
           field_mapping = mapping[field_class.index_fieldname]

           if field_mapping['type'] == 'string' and field_class.indexed:
               if not hasattr(field_class, 'facet_for') and not \
                                 field_class.field_type in('ngram', 'edge_ngram'):
                   field_mapping['analyzer'] = self.DEFAULT_ANALYZER)
           mapping.update({field_class.index_fieldname: field_mapping})
       return (content_field_name, mapping)
This update closely follows how the base method is written, including iterating through the fields as well as ignoring nGram fields. Now on reindexing all of your non-nGram indexed content will be analyzed with your specified analyzer. For explicitness the default analyzer is directly set as an attribute.

Search analyzers by field

We’ve now set up a configurable default analyzer, but why not control this on a field by field basis? It should be pretty straightforward. We’ll just subclass the fields, adding an analyzerattribute via a keyword argument.
class ConfigurableFieldMixin(object):

   def __init__(self, **kwargs):
       self.analyzer = kwargs.pop('analyzer', None)
       super(ConfigurableFieldMixin, self).__init__(**kwargs)
And then define a new field class using the mixin:
from haystack.fields import CharField as BaseCharField

class CharField(ConfigurableFieldMixin, BaseCharField):
   pass
Just be sure to import and use the new field rather than the field from the indexes module as you’d normally do. This establishes which analyzer the field should use, but doesn’t actually use the analyzer for indexing. Again, we need to extend the subclassed backend to do so, focusing on the build_schema method.
class ConfigurableElasticBackend(ElasticsearchSearchBackend):

   DEFAULT_ANALYZER = "snowball"

   def __init__(self, connection_alias, **connection_options):
       super(ConfigurableElasticBackend, self).__init__(
                               connection_alias, **connection_options)
       user_settings = getattr(settings, 'ELASTICSEARCH_INDEX_SETTINGS')
       if user_settings:
           setattr(self, 'DEFAULT_SETTINGS', user_settings)

   def build_schema(self, fields):
       content_field_name, mapping = super(ConfigurableElasticBackend,
                                             self).build_schema(fields)

       for field_name, field_class in fields.items():
           field_mapping = mapping[field_class.index_fieldname]

           if field_mapping['type'] == 'string' and field_class.indexed:
               if not hasattr(field_class, 'facet_for') and not \
                                 field_class.field_type in('ngram', 'edge_ngram'):
                   field_mapping['analyzer'] = getattr(field_class, 'analyzer',
                                                           self.DEFAULT_ANALYZER)
           mapping.update({field_class.index_fieldname: field_mapping})
       return (content_field_name, mapping)
If you wanted to control nGram analysis on a field by field basis simply remove the conditional.

Putting it all together

When you update your project settings to use the new backend, ensure that you’re referring to an engine instance (BaseEngine), not a backend instance (BaseSearchBackend). Given that we’ve just defined a new backend instance, we’ll need to also go ahead and define a new search engine.
from haystack.backends.elasticsearch_backend import ElasticsearchSearchEngine

class ConfigurableElasticSearchEngine(ElasticsearchSearchEngine):
   backend = ConfigurableElasticBackend
Now simply update your project settings accordingly to reference your new search engine backend and you’re good to go.
HAYSTACK_CONNECTIONS = {
   'default': {
       'ENGINE': 'myapp.backends.ConfigurableElasticSearchEngine',
       'URL': env_var('HAYSTACK_URL', 'http://127.0.0.1:9200/'),
       'INDEX_NAME': 'haystack',
   },
}
ELASTICSEARCH_INDEX_SETTINGS = {
   # index settings
}
ELASTICSEARCH_DEFAULT_ANALYZER = "snowball"
Don’t forget to update your index.



5. 마지막으로 위의 방법을 따르되 settings.py의 설정을 다음과 같이 한다.(은전한닢의 analyzer와 tokenizer를 setting에 추가해야한다)


HAYSTACK_CONNECTIONS = {
 'default': {
     'ENGINE': 'elasticstack.backends.ConfigurableElasticSearchEngine',
     'URL': 'http://127.0.0.1:9200/',
     'INDEX_NAME': 'potenup',
 },
}
HAYSTACK_SIGNAL_PROCESSOR = 'haystack.signals.RealtimeSignalProcessor'


ELASTICSEARCH_DEFAULT_ANALYZER = 'korean_index'


ELASTICSEARCH_INDEX_SETTINGS = {
 'settings': {
     "analysis": {
         "analyzer": {
             "korean_index": {
                 "type": "custom",
                 "tokenizer": "mecab_ko_standard_tokenizer"
             },
             "korean_query": {
                 "type": "custom",
                 "tokenizer": "korean_query_tokenizer"
             },
             "ngram_analyzer": {
                 "type": "custom",
                 "tokenizer": "standard",
                 "filter": ["haystack_ngram", "lowercase"]
             },
             "edgengram_analyzer": {
                 "type": "custom",
                 "tokenizer": "standard",
                 "filter": ["haystack_edgengram", "lowercase"]
             }
         },
         "tokenizer": {
             "korean_query_tokenizer": {
                 "type": "mecab_ko_standard_tokenizer",
                 "compound_noun_min_length": 100
             },
             "haystack_ngram_tokenizer": {
                 "type": "nGram",
                 "min_gram": 3,
                 "max_gram": 15,
             },
             "haystack_edgengram_tokenizer": {
                 "type": "edgeNGram",
                 "min_gram": 2,
                 "max_gram": 15,
                 "side": "front"
             }
         },
         "filter": {
             "haystack_ngram": {
                 "type": "nGram",
                 "min_gram": 3,
                 "max_gram": 15
             },
             "haystack_edgengram": {
                 "type": "edgeNGram",
                 "min_gram": 2,
                 "max_gram": 15
             }
         }
     }
 }
}


댓글 1개: