What you might notice is that for this game of RBC I used a chess engine, and had to have code around it to make up for the fact that RBC isn't really straight chess. Because of that there are some notable flaws:

- The engine assumes that it would know if it was put into check. In RBC you might not know, and this could cost you the game.
- The engine assumes that whatever board state you give it is true. This means that the "information value" of a move isn't taken into account.

If you were to try to write a tree search which took into account the sets of potential boards you would run into another problem. The imperfect information version of chess has an exponentially larger tree than regular chess, which is already a pretty massive game.

In order to even approach searching that tree you would need some way of accelerating your tree search.

To do this I developed a small algorithm which in my very preliminary testing gives around a 3 times speedup with logarithmic time and poly-logarithmic space complexity

The full writeup can be found here, but I will give a quick overview.

The method came out of observing that you could lift state valuations from the perfect information variant of a game, and that the valuation of a set of states should correspond to the minimum valuation of the states in it. All this means is that if the opponent is playing optimally you will end up in your worst case scenario.

This has a nice property if the sets of states are finite. If we have two sets of states X and Y, and X is contained in Y, then the value of X is greater than or equal to the value of Y. In math:

$$ X \subseteq Y \implies val(X) \ge val(Y)$$

Which gives:

$$ X \subseteq Y \subseteq Z \implies val(X) \ge val(Y) \ge val(Z)$$

This is really useful, as to actually compute val(X) you have to traverse the entire game tree after X, which is an extremely expensive operation. For many tree searches we can short cut certain paths if we know that there is a better one available already, so if we can find close enough bounds on the value of X we might not even need to search past that point in the game tree.

There are two major issues here though.

First, even figuring out if X is a subset of Y can be expensive. To compute that takes time linear with respect to the size of the sets we are talking about, and in the case of RBC those sets can be sized in the millions or tens of millions.

Secondly, we encounter and value millions and millions of different information sets, to store all of their values and iterate across them all to find the smallest superset and largest subset would be incredibly expensive.

For those two issues there are two approximate solutions:

**1) Computing subsets and supersets**

To fix this first problem we will weaken our requirements slightly. We don't care about finding an exact subset or exact superset, in fact we don't fully need one.

All we need for our bounds to exist is that the minimally valued state in our set is in the intersection of our sets. We can estimate the likelihood of this happening if we know the size of the intersection and the size of our sets. Storing the size of our sets has very little cost, as it is a single integer, so all we need is a method of estimating the size of the intersection of two sets.

That is where Jaccard Similarity and MinHash come in.

Jaccard Similarity is the ratio of the size of the intersection of two sets to the size of their union. MinHash is a clever algorithm that can quickly approximate the Jaccard Similarity.

How MinHash works is by taking a single hash function, hashing all of the elements and storing the k smallest hash values. To estimate the Jaccard Similarity you just look at how many of their k smallest values overlap, the ratio of that overlap to k will approximate the Jaccard Similarity.

Once we have the Jaccard Similarity, if we know the size of our sets, using a tiny bit of algebraic manipulation we can get an estimation of the size of the intersection of our sets, and once we have that we can find the probability that a given set is an upper bound or lower bound in value of another set.

__2) Finding which information sets to keep__

Just as above, once we separate our problem out into these smaller pieces we find that they have already been tackled very adeptly.

All we are searching for here is to find the sets which most often appear in our tree or appear as an upper or lower bound. That is the exact same as the APPROX-TOP problem, the problem of finding the approximate most frequent elements in a stream of data.

A very clever and cool solution to that problem is called Count Sketch. Count Sketch works by keeping a matrix of counts. Each row has two hash functions, one to assign elements to a column in that row and one to give them either a positive or negative sign.

When we encounter an element, we go through each row and find the corresponding column given by the hash function, and we increment or decrement the count there depending on the sign given by the row's second hash function.

To estimate the count of an element, we go through and on each row we find the corresponding column, multiply it by the sign that the second hash function gives, and we take the median of all such elements.

The expected value of that median will be the frequency of that element.

Using this we can quickly estimate the most frequent sets we encounter, and keep a list of only the most frequent elements which are most likely to serve as upper or lower bounds.

**Using these two solutions in conjunction I obtained a 3x speedup when searching the RBC game tree.**

While the algorithm developed still has quite a few restrictions and is bound to certain classes of games and tree searches, I believe it can be extended to a wider class, and that the speedup obtained on the preliminary trials can even further be improved.

]]>The method came out of observing that you could lift state valuations from the perfect information variant of a game, and that the valuation of a set of states should correspond to the minimum valuation of the states in it. All this means is that if the opponent is playing optimally you will end up in your worst case scenario.

This has a nice property if the sets of states are finite. If we have two sets of states X and Y, and X is contained in Y, then the value of X is greater than or equal to the value of Y. In math:

$$ X \subseteq Y \implies val(X) \ge val(Y)$$

Which gives:

$$ X \subseteq Y \subseteq Z \implies val(X) \ge val(Y) \ge val(Z)$$

This is really useful, as to actually compute val(X) you have to traverse the entire game tree after X, which is an extremely expensive operation. For many tree searches we can short cut certain paths if we know that there is a better one available already, so if we can find close enough bounds on the value of X we might not even need to search past that point in the game tree.

There are two major issues here though.

First, even figuring out if X is a subset of Y can be expensive. To compute that takes time linear with respect to the size of the sets we are talking about, and in the case of RBC those sets can be sized in the millions or tens of millions.

Secondly, we encounter and value millions and millions of different information sets, to store all of their values and iterate across them all to find the smallest superset and largest subset would be incredibly expensive.

For those two issues there are two approximate solutions:

To fix this first problem we will weaken our requirements slightly. We don't care about finding an exact subset or exact superset, in fact we don't fully need one.

All we need for our bounds to exist is that the minimally valued state in our set is in the intersection of our sets. We can estimate the likelihood of this happening if we know the size of the intersection and the size of our sets. Storing the size of our sets has very little cost, as it is a single integer, so all we need is a method of estimating the size of the intersection of two sets.

That is where Jaccard Similarity and MinHash come in.

Jaccard Similarity is the ratio of the size of the intersection of two sets to the size of their union. MinHash is a clever algorithm that can quickly approximate the Jaccard Similarity.

How MinHash works is by taking a single hash function, hashing all of the elements and storing the k smallest hash values. To estimate the Jaccard Similarity you just look at how many of their k smallest values overlap, the ratio of that overlap to k will approximate the Jaccard Similarity.

Once we have the Jaccard Similarity, if we know the size of our sets, using a tiny bit of algebraic manipulation we can get an estimation of the size of the intersection of our sets, and once we have that we can find the probability that a given set is an upper bound or lower bound in value of another set.

Just as above, once we separate our problem out into these smaller pieces we find that they have already been tackled very adeptly.

All we are searching for here is to find the sets which most often appear in our tree or appear as an upper or lower bound. That is the exact same as the APPROX-TOP problem, the problem of finding the approximate most frequent elements in a stream of data.

A very clever and cool solution to that problem is called Count Sketch. Count Sketch works by keeping a matrix of counts. Each row has two hash functions, one to assign elements to a column in that row and one to give them either a positive or negative sign.

When we encounter an element, we go through each row and find the corresponding column given by the hash function, and we increment or decrement the count there depending on the sign given by the row's second hash function.

To estimate the count of an element, we go through and on each row we find the corresponding column, multiply it by the sign that the second hash function gives, and we take the median of all such elements.

The expected value of that median will be the frequency of that element.

Using this we can quickly estimate the most frequent sets we encounter, and keep a list of only the most frequent elements which are most likely to serve as upper or lower bounds.

While the algorithm developed still has quite a few restrictions and is bound to certain classes of games and tree searches, I believe it can be extended to a wider class, and that the speedup obtained on the preliminary trials can even further be improved.

A few weeks ago, with the competition all done, I was invited to NeurIPS by the JHUAPL to talk about how my bot worked.

I figured that it might be a bit interesting to other people then, to hear how my bot worked.

First I should talk about what the game was.

Reconnaissance Blind Chess is an imperfect information version of chess with a slight twist from other common imperfect information games.

An imperfect information game is any game in which you don't know the full state of the board. The most common example is Poker. You know your hand, and what is on the field, but not the other players' hands.

Reconnaissance Blind Chess is an imperfect information version of chess with a slight twist from other common imperfect information games.

An imperfect information game is any game in which you don't know the full state of the board. The most common example is Poker. You know your hand, and what is on the field, but not the other players' hands.

In RBC, you know the position of your pieces, but not quite the position of the other player's. You aren't told how they move, and can't "see" the board. Instead you are told the results of your move (what move actually took place, were you blocked early, did you take a piece) and if they captured one of your pieces and which one. The twist is that you have another mechanism to manage uncertainty, you get to choose a 3x3 square each turn to "scan", which reveals what is on each of those tiles. |

The big issue with this game is that each turn there is O(20) moves available, which means that if your scan is not a "good" scan (i.e. revealing what specific move was taken), the amount of potential boards grows by a factor of around 20.

In the tournament you only had a total of 15 minutes to make all of your moves. This, combined with the exponential growth of the potential boards, meant that whatever you did had to be quite lean and efficient if you were to take into account the information across all potential boards.

In the tournament you only had a total of 15 minutes to make all of your moves. This, combined with the exponential growth of the potential boards, meant that whatever you did had to be quite lean and efficient if you were to take into account the information across all potential boards.

So what would a bot to play this game look like?

There are essentially two tasks at play:

My bot was actually pretty naive for both of those.

__Scanning__

There could be quite a few potential boards, and my main goal was to minimize the number. To do this I needed a lean O(n) algorithm, as since the number of potential boards could quickly reach the thousands or millions we couldn't afford to ponder for long at this stage.

The jist of what I did was as follows:

__Moving__

I'm not good at chess. In my family I am by far the worst chess player. Chess is a highly studied game, and there are people, particularly programmers, who are really amazing at chess. I did not have the time or knowledge to make a chess engine that could compete with what was out there.

So despite RBC not really being chess, I (and as I found out at NeurIPS, almost all of the competitors) used a chess engine to drive the movement policy.

The jist of what I did was the following:

There was some additional jerry-rigged heuristics that were put into play as a result of using a chess engine.

In chess you never actually capture the king, in RBC to win you do. So if there was an opportunity to capture a king then the bot would take it.

If there wasn't an opportunity to capture the king, and one of the potential boards was in check, then we would prioritize moves that would keep the potential boards out of check.

**But both of those policies are ***O(n)* with regards to the number of potential states, which grows exponentially.

The computer I was running my bot on wasn't the strongest computer in the world. Even with those policies as trimmed and fast as I could get them, there were many situations where a single turn would cause my bot to time out because there were just way too many potential boards to compute.

In addition there seemed to be a "critical point" of boards, if the number got too big then the amount of information that a single scan could give would be far less than the amount of new boards generated each turn.

So my solution:**Uniform Random Sampling**

Seriously.

There are essentially two tasks at play:

**Scan**. Choose where to scan to either minimize the amount of potential boards or to help choose a move.**Move**. Choose the move that will give you the best shot of winning across the different potential boards.

My bot was actually pretty naive for both of those.

There could be quite a few potential boards, and my main goal was to minimize the number. To do this I needed a lean O(n) algorithm, as since the number of potential boards could quickly reach the thousands or millions we couldn't afford to ponder for long at this stage.

The jist of what I did was as follows:

- Make a 8x8 array of "partitions"
- For each potential board state, go across each tile and mark what is there in the associated position in the array (i.e. if we see a black knight then we increment the black knight count in that partition)
- Select the tile with the smallest largest count in the partition, this would be the center of our scan.

I'm not good at chess. In my family I am by far the worst chess player. Chess is a highly studied game, and there are people, particularly programmers, who are really amazing at chess. I did not have the time or knowledge to make a chess engine that could compete with what was out there.

So despite RBC not really being chess, I (and as I found out at NeurIPS, almost all of the competitors) used a chess engine to drive the movement policy.

The jist of what I did was the following:

- For each potential board, generate a list of the top
moves using the StockFish engine. Add those to a list of potential moves.**k** - For each potential move, simulate that move on each potential board and use StockFish to score the resulting board. Keep track of the worst possible score for each move.
- Select the move with the best worst score (the min-max result)

There was some additional jerry-rigged heuristics that were put into play as a result of using a chess engine.

In chess you never actually capture the king, in RBC to win you do. So if there was an opportunity to capture a king then the bot would take it.

If there wasn't an opportunity to capture the king, and one of the potential boards was in check, then we would prioritize moves that would keep the potential boards out of check.

The computer I was running my bot on wasn't the strongest computer in the world. Even with those policies as trimmed and fast as I could get them, there were many situations where a single turn would cause my bot to time out because there were just way too many potential boards to compute.

In addition there seemed to be a "critical point" of boards, if the number got too big then the amount of information that a single scan could give would be far less than the amount of new boards generated each turn.

So my solution:

Seriously.

Why Uniform Sampling? Why not weigh based off of the scoring of the boards? It turned out that similar boards tended to have pretty similar scores, so if we weighed the chance of a board being selected either positively or negatively with its score, then we would end up with a sample of boards that looked very similar. The issue with the boards all looking very similar is that scans give far less information the more similar boards looked.

By uniformly sampling we get a selection of boards that often looked more distinct, which let us discriminate between them far easier with scans.

__How did it perform?__

Pretty well! I got 1st place in the two test-tournaments (although many bots were not fully developed by this point), and 5th place in the final tournament. The results, and replays of the games, can be found here.

All in all, for my first real outing in a game-tournament I'm pretty happy with my bot's performance.

By uniformly sampling we get a selection of boards that often looked more distinct, which let us discriminate between them far easier with scans.

Pretty well! I got 1st place in the two test-tournaments (although many bots were not fully developed by this point), and 5th place in the final tournament. The results, and replays of the games, can be found here.

All in all, for my first real outing in a game-tournament I'm pretty happy with my bot's performance.

If the number of boards got too large, I would independently and uniformly sample a smaller amount of boards from the set of potential boards, and stash away the rest. If later on I found an observation that would show that all of these selected boards were false, then I took the earliest stashed away set of boards and simulated them up to the current point.The reason I took the earliest stash and not the latest (which would be far faster on a one-off recovery) was to amortize the cost of recovery. The worst case cost to recover would be pushed further up and lessened every time that the bot went through a recovery. |

I haven't cleaned it up, and it shows quite a bit. But for the curious who are willing to delve into the jerry-rigged code that evolved out of half a year of working on this, the full code for my bot is below:

A coding competition, and the start of classes have delayed work on Equi, so right now multithreading is only in the development branch as it is only 50% feature complete. There's a persistent bug when it comes to returning values from completed tasks, but I have the solution figured out, just not the time yet to implement it.

In the meantime Equi is a fun little toy language. As soon as my schedule opens up a bit more I plan on finishing up the core parrallelism of Equi and then moving on to implementing a type system and polymorphism into Equi. For now there's just primitive types but many more features are coming to Equi.]]>

Right now most of my time has been devoted to classes and many many math problem sets and readings, but I do have some software in the works. The main one should be making a debut here in a month or two and it is (drumroll please):

What it is right now is a scripting language that is slowly evolving, what it will be one day is a interpreted fully parallel programming language made for prototyping asynchronous big data programs on affordable clusters. There's a lot of work to get to those buzzwords though, and that is what I have been silently chugging away at.

And it turns out that, to no one's surprise, programming languages are hard. They actually use a ton of techniques from natural language processing, because they are in their own way a language. Taking a line of code such as:

if (a == b) print(a);

And just understanding how it breaks down as a grammar is actually really nontrivial. You can't just read it left to right and figure out what to do (maybe this one example you can, but when it gets more complicated you cannot), because the grammar of a programming language isn't actually regular, they are usually context free.

What that means is that we need to first define a grammar for our language, and then we need to find a way to work backwards from that grammar and build a "syntax tree" of what we want our code to do in a way that is more understandable for the code and in line with our intentions.

To the left is what a very early version of our programming language's grammar is defined as. To the right is our "parse tree" for what the code: if (a==b) print(a); is from. This way we can take our intentions for how the code "should run", codify them in a grammar, and then hopefully when we run and parse the code it will do what we want. |

The current version just has logical operators and declarations working (as you can sorta make out from the CFG), but it does do some things.

More details will come in a little bit. I have most of the ground work done on the very basic version of Equi in this very productive weekend, so progress should speed up. Right now Equi is not turing complete, and has some very odd little quirks in the design that need to be ironed out, but as soon as it is slightly more impressive you will be sure to see it open sourced and several more write ups coming on out on a bit more of the process and ideas at work behind the code.

]]>The premise of the problem was this: Using Google's Word2Vec you can take a sentence and learn "word embeddings", or N-dimensional euclidean vectors, that are supposed to encode the meaning of each word. Similar words should be embedded in similar places. What if you do this for two translations of a text? Can you use the word embeddings in one language to predict the embeddings in another?

The answer is a solid sorta.

First for why word embeddings are important at all.

Usually when you have a sentence, you encode it as a sequence of really high dimensional one-hot vectors. In other words you'll have a V dimensional vector - where V is the total number of*unique *words, i.e. in the millions for any substantial text - and for each vector it will be all zeroes and then just one 1 in the spot corresponding to whichever word it is. There's a few problems with this, first off the dimensionality is probably way too high -- which is something we can take advantage of (this is very similar to the idea of Self Organizing Maps which I've covered before). And second off, the dimensionality is way too high -- to do a Neural Model we would have to have an insane amount of nodes to be able to reduce that to a reasonable point. And third off, those vectors tell us nothing. The distance between any two words is identical -- there's absolutely no semantic content loaded in those word embeddings.

So what we do is we run it through a fancy neural network black box, and then figure out a reduced dimensional representation of the word that hopefully encodes some of the relations between words.

Now that we have word embeddings though, can we do anything fancy with them? Ideally across languages words should embed in the same spots if they have the same idea, unfortunately they don't with our current methods -- but we sort of hope that we can use the spatial relations in one language to be able to help translate in another as a sort of language model.

There has been some work on finding maps between parallel word embeddings in the past, but all of the methods my partners and I found were very linear -- the primary one being to find the best fitting rotation between the data. That didn't sit well with me -- linearity is a really big assumption to make, and the way word embeddings are generated there's no guarantee of a linear relation.

So with my love of manifolds and my partner's begrudging allowance, we decided for our Machine Translation final project to take on this task, and try and show that a very nonlinear transformation would work better than the current linear method.

And....

Usually when you have a sentence, you encode it as a sequence of really high dimensional one-hot vectors. In other words you'll have a V dimensional vector - where V is the total number of

So what we do is we run it through a fancy neural network black box, and then figure out a reduced dimensional representation of the word that hopefully encodes some of the relations between words.

Now that we have word embeddings though, can we do anything fancy with them? Ideally across languages words should embed in the same spots if they have the same idea, unfortunately they don't with our current methods -- but we sort of hope that we can use the spatial relations in one language to be able to help translate in another as a sort of language model.

There has been some work on finding maps between parallel word embeddings in the past, but all of the methods my partners and I found were very linear -- the primary one being to find the best fitting rotation between the data. That didn't sit well with me -- linearity is a really big assumption to make, and the way word embeddings are generated there's no guarantee of a linear relation.

So with my love of manifolds and my partner's begrudging allowance, we decided for our Machine Translation final project to take on this task, and try and show that a very nonlinear transformation would work better than the current linear method.

And....

It did! By a lot. (That error bar for Bulgarian on the neural model is right by the way, and I'll touch on that later)

What did we do though? Well if you remember me mentioning Self Organizing Maps earlier, we tried something very similar. What we modified was an algorithm called an Elastic Map which is a very similar idea. Instead of just uniformly pulling on the closest representation node as we did for the SOM, we create a "Springy Mesh" over the space, and pull on that mesh towards our training data. We had to modify the algorithm however, as like a SOM, Elastic Maps find__a__ manifold over the data, and we don't want a manifold, we want a linear transformation.

So we modified the Elastic Map to approximate the specific transformation we wanted, as well as encode the information we needed to make a transformation. The precise details you can read about in our writeup (which will be posted here) or look through in our code (here), but the jist of it is we defined three forces on our springy mesh: A deformation force that pulls the nodes we want mapped to specific spots towards those spots, a "continuity" force that pulls nodes near one another near one another, and a "linearity" force that has "ribs" of triplets of nodes and tries to pull the middle node to the spot between the two outer nodes (resisting the map getting way too twisted). Once we have those three forces we can compute gradient descent to optimize our map.

And the map performed really well! You can see in the above graph that the average error is almost consistently less than a*quarter* of the linear method, and by the error bars (which represent one standard deviation of MSE) even at the worst case in one standard deviation it still is outperforming the average case for the linear method.

We also ran a neural network, because you can never talk about maps without doing neural networks, and what we found is that our method is good (although lacking in tuning, as time was running short on our final project) on average the neural network beats it, however something interesting happened. In the above ran model, on Bulgarian the neural model had a standard deviation of over 400. On just about each neural model ran, there was some dataset that the model on average preformed well, but had a large tail of extreme outliers, whereas the elastic map performed very consistently.

So while there's definitely room for improvement with the elastic map model, this first foray into implementing it I think definitely showed that it is a method worth looking into.

]]>What did we do though? Well if you remember me mentioning Self Organizing Maps earlier, we tried something very similar. What we modified was an algorithm called an Elastic Map which is a very similar idea. Instead of just uniformly pulling on the closest representation node as we did for the SOM, we create a "Springy Mesh" over the space, and pull on that mesh towards our training data. We had to modify the algorithm however, as like a SOM, Elastic Maps find

So we modified the Elastic Map to approximate the specific transformation we wanted, as well as encode the information we needed to make a transformation. The precise details you can read about in our writeup (which will be posted here) or look through in our code (here), but the jist of it is we defined three forces on our springy mesh: A deformation force that pulls the nodes we want mapped to specific spots towards those spots, a "continuity" force that pulls nodes near one another near one another, and a "linearity" force that has "ribs" of triplets of nodes and tries to pull the middle node to the spot between the two outer nodes (resisting the map getting way too twisted). Once we have those three forces we can compute gradient descent to optimize our map.

And the map performed really well! You can see in the above graph that the average error is almost consistently less than a

We also ran a neural network, because you can never talk about maps without doing neural networks, and what we found is that our method is good (although lacking in tuning, as time was running short on our final project) on average the neural network beats it, however something interesting happened. In the above ran model, on Bulgarian the neural model had a standard deviation of over 400. On just about each neural model ran, there was some dataset that the model on average preformed well, but had a large tail of extreme outliers, whereas the elastic map performed very consistently.

So while there's definitely room for improvement with the elastic map model, this first foray into implementing it I think definitely showed that it is a method worth looking into.