Thứ Bảy, 31 tháng 3, 2018

Apache Solr : Binary Field

In solr, binary field use for binary data storage. When you need to optimize your Solr (performance, speed), you should consider to use that field.
When using with BinaryField, the data should be sent/retrieved in as Base64 encoded Strings.
Pay attention to the byte order you use.( BigEndian or LittleEndian)

In my case,

- I use Python to calculate some metrics and generate a vector represented for item, a vector with
8 float numbers. I use base64encode to encode this vector and save it to solr, for example:
value = 5.1
va = bytearray(struct.pack("f", value))
base64.b64encode(va)


The bytearray func use LittleEndian by default, you can replace with BigEndian order like that:
va = bytearray(struct.pack(">f", value))

- I use Java to read the info from BinaryField:
java.nio.ByteBuffer.wrap(bytes).getFloat()


This use BigEndian order by default, you need to change it to same order you use to write to
Solr by:
ByteBuffer.wrap(bytes).order(ByteOrder.BIG_ENDIAN).getFloat()
or 
ByteBuffer.wrap(bytes).order(ByteOrder.LITTLE_ENDIAN).getFloat()




Race condition with redis transaction

Suppose that you have amount of item X need to be sold. And there are a lot of user log into your website and buy item X at same time. How will you sell as much as possible?

There are two phase when a user want to buy an item X.
- Checking phase : Checking if item X is available.
- Selling phase: If there is available item X, you will sell it to user and decrease amount of available X.
For example, you have 5 item X. There are 10 users buy item X at same time. 10 orders happen at same time. What if they all are at checking phase? Your system will accept 10 order while you just have 5 item. So how you solve it? This post will give a solution for this by using redis transaction.
To know more about redis transaction, you can read it here:
https://redis.io/topics/transactions

If you just need to execute a sequence commands without checking logic between commands, you can use MULTI - EXEC. But in this case, we need to check the logic if item X is available before deciding to sell it, so we need to use an other option: WATCH.

I will use GO language to show how we deal with it.
Item's status will be save in redis with two key
- Quantity: Amount of item available in ware house.
- Reserved: Amount of success order(but it haven't delivery to user yet)


WATCHed keys are monitored in order to detect changes against them. If at least one watched key is modified before the EXEC command, the whole transaction aborts, and EXEC returns a Null reply to notify that the transaction failed

When the transaction failed, we will recall the function that process order until there is no conflicted transaction.

func reserve(key string) error {
    err := client.Watch(func(tx *redis.Tx) error {
        curQuantity, _ := tx.Get(keyQuantity).Int64()
        curReserved, _ := tx.Get(keyReserved).Int64()

        if((curQuantity - curReserved) > 1){
          _, err = tx.Pipelined(func(pipe *redis.Pipeline) error {
            pipe.Inc(key)
            return nil
          })
        }
        return err
    }, keyQuantity, keyReserved)
    if err == redis.TxFailedErr {
        return reserve(key)
    }
    return err
}




Apache Solr : Personalize Rerank Query


The meaning of re-rank query in solr:

Query Re-Ranking allows you to run a simple query (A) for matching documents and then re-rank the top N documents using the scores from a more complex query (B).
But sometime you need to customize deeply more than a complex query (B), for example : get some metrics from database to calculate new score, this post will help you do it.

Firstly, we need to clarify that if we use the interface Solr provide, we can't do what we want to customize. So we need to customize by defining an other QParserPlugin, the original plugin is 

org.apache.solr.search.ReRankQParserPlugin. We will use the code in this class to customize what we want.

The original plugin use reRankQuery to calculate new score for document, we don't use query tocaculate new score for document, we use user information queried from db, combining with some
document's metrics to calculate document's score per user. That mean we need to access db for each
Personalized Rerank Query.We will remove reRankQuery because we don't need it.
TopDocs rescoredDocs = new QueryRescorer(reRankQuery) {
@Override protected float combine(float firstPassScore, boolean secondPassMatches,
float secondPassScore) {
float score = firstPassScore;
if (secondPassMatches) {
score += reRankWeight * secondPassScore;
}
return score;
}
}.rescore(searcher, mainDocs, mainDocs.scoreDocs.length);


We can remove above block code with your custom code. We can access db to get the
user's information. Combining with the document's information through
IndexSearcher.doc(docId) to calculate new score.
The new score result need to be updated to mainDocs.scoreDocs. If you want to get original score from sort formula, you should set fillFields = true instead that original initialization:

this.mainCollector = TopFieldCollector.create(sort, Math.max(this.reRankDocs, length),
false, true, true, true);


ScoreDoc will become the instance of FieldDoc, and you can get the original score from FieldDoc.

Note: 
*Pay attention to performance cost when you access to external system in re-rank query. For example
- If you need to get user's information, you can put user's info in redis db, and use Guava caching to cache user info (Apache Solr use Guava for caching)
- To get some metrics of document, you can index them to document, or just load it
into heap memory.
*Rerank query can't use with group function, you can use (collapse, expand) instead group.