Hot questions for Using Neural networks in coordinates


I’m trying to build a neural network that takes as inputs the vertices position of a 3d mesh, and outputs the coordinates of two points on the inside.

for testing purpose I have a dataset containing a geometry with 20 points and two points on the inside for each one.

Every file of the dataset contains the coordinates of the vertices in a rank 2 with shape [3,20] array for the objs and shape [3,3] for the resulting points.

I’ve built a linear model, but the outcome is always very low (0,16) , doesn’t matter if I train it with 1000, 100.000 or 500.000

import tensorflow as tf
import numpy as np

objList    = np.load('../testFullTensors/objsArray_00.npy')
guideList  = np.load('..testFullTensors/drvsArray_00.npy')

x  = tf.placeholder(tf.float32, shape=[None, 60])
y_ = tf.placeholder(tf.float32, shape=[None, 6])

W = tf.Variable(tf.zeros([60,6],tf.float32))
b = tf.Variable(tf.zeros([6],tf.float32))

y = tf.matmul(x,W) + b

cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y))

train_step = tf.train.GradientDescentOptimizer(0.5).minimize(cross_entropy)

with tf.Session() as sess:{x: objList, y_: guideList})
    correct_prediction = tf.equal(tf.argmax(y,1), tf.argmax(y_,1))
    accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))
    print accuracy.eval(session=sess , feed_dict={x: objs, y_: guides})`

should I build a different kind of network?

Thanks E


First, thanks for the clarification of the question in the comments, it really helps understand the problem.

The problem as I understand it is (at least similar to) : given a bounding set of 3D points of the outside of an arm, identify

  • A the point in 3D that is on the Humerus that is closest to the body
  • B the point in 3D that is on the Humerus that is furthest from the body

What we need is a model that has enough expressivity to be able to do this. Let us consider how this problem is easiest for a human first. If a human was given a 3D model that they could look at and rotate then it would be a visual problem and they would probably get it right away.

If it was a list of 60 numbers and they were not told what those numbers meant and they had to product 6 numbers as an answer then it may not be possible.

We know that TensorFlow is good at image recognition, so let's turn the problem into an image recognition problem.

Let's just start with the MNIST network and talk about what it would take to change it to our problem!

Convert your input to voxels such that each training example will be one 3D image of size [m,m,m] where m is the resolution you need (start with 30 or so for initial testing and maybe go as high as 128). Initialize your 3D matrix with 0's. Then for each of the 20 data points change the corresponding voxel to 1 (or a probability).

That is you input, and since you have lots of training examples you will have a tensor of [batch,m,m,m].

Do the same for your expected output.

Send that through layers of convolution (start with 2 or 3 for testing) such that your output size is [batch,m,m,m].

Use back propagation to train your output layer to predict your expected output.

Finally you will have a network that doesn't return a 3D coordinate of the Humerus but instead returns a probability graph of where it is in 3D space. You can scan the output for the highest probabilities and read off the coordinates.

This is very similar to how AlphaGo is beating Go.

suggested improvement - train 1 network to predict A and a separate network to predict B


I am trying to implement a classification NN in Matlab. My inputs are clusters of coordinates from an image. (Corresponding to delaunay triangulation vertexes) There are 3 clusters (results of the optics algorithm) in this format:

( Not all clusters are of the same size.). Elements represent coordinates in euclidean 2d space . So (110,12) is a point in my image and the matrix depicted represents one cluster of points. Clustering was done on image edges. So coordinates refer to logical values (always 1s in this case) on the image matrix.(After edge detection there are 3 "dense" areas in an image, and these collections of pixels are used for classification). There are 6 target classes.

So, my question is how can I format them into single column vector inputs to use in a neural network? (There is a relevant answer here but I would like some elaboration if possible. ( I am probably too tired right now from 12 hours of trying stuff and dont get it 100% :D :( ) Remember, there are 3 different coordinate matrices for each picture, so my initial thought was, create an nn with 3 inputs (of different length). But how to serialize this?

Here's a cluster with its tags on in case it helps:


For you to train the classifier, you need a matrix X where each row will correspond to an image. If you want to use a coordinate representation, this means all images will have to be of the same size, say, M by N. So, the row of an image will have M times N elements (features) and the corresponding feature values will be the cluster assignments. Class vector y will be whatever labels you have, that is one of the six different classes you mentioned through the comments above. You should keep in mind that if you use a coordinate representation, X can get very high-dimensional, and unless you have a large number of images, chances are your classifier will perform very poorly. If you have few images, consider using fractions of pixels belonging to clusters that I suggested in one of the comments: this can give you a shorter feature description that is invariant to rotation and translation, and may yield better classification.


Neural Networks are mostly used to classify. So, the activation of a neuron in the output layer indicates the class of whatever you are classifying.

Is it possible (and correct) to design a NN to get 3D coordinates? This is, three output neurons with values in ranges, for example [-1000.0, 1000.0], each one.


Yes. You can use a neural network to perform linear regression, and more complicated types of regression, where the output layer has multiple nodes that can be interpreted as a 3-D coordinate (or a much higher-dimensional tuple).

To achieve this in TensorFlow, you would create a final layer with three output neurons, each corresponding to a different dimension of your target coordinates, then minimize the root mean squared error between the current output and the known value for each example.


I'd like to get the coordinates of all areas containing any text in scans of documents like the one shown below (in reduced quality; the original files are of high resolution):

I'm looking for something similar to these (GIMP'ed-up!) bounding boxes. It's important to me that the paragraphs be recognized as such. If the two big blocks (top box on left page, center block on right page) would get two bounding boxes each, though, that would be fine:

The way of obtaining these bounding box coordinates could be through some kind of API (scripted languages preferred over compiled ones) or through a command line command, I don't care. What's important is that I get the coordinates themselves, not just a modified version of the image where they're visible. The reason for that is that I need to calculate the area size of each one of them and then cut out a piece at the center of the largest.

What I've already tried, so far without success:

  • ImageMagick - it's just not meant for such a task
  • OpenCV - either the learning curve is too high or my google-foo too bad
  • Tesseract - from what I've been able to garner, it's the one-off OCR software that, for historical reasons, doesn't do Page Layout Analysis before attempting character shape recognition
  • OCRopus/OCRopy - should be able to do it, but I'm not finding out how to tell it I'm interested in paragraphs as opposed to words or characters
  • Kraken ibn OCRopus - a fork of OCRopus with some rough edges, still fighting with it
  • Using statistics, specifically, a clustering algorithm (OPTICS seems to be the one most appropriate for this task) after binarization of the image - both my maths and coding skills are insufficient for it

I've seen images around the internet of document scans being segmented into parts containing text, photos, and other elements, so this problem seems to be one that has academically already been solved. How to get to the goodies, though?


In Imagemagick, you can threshold the image to keep from getting too much noise, then blur it and then threshold again to make large regions of black connected. Then use -connected-components to filter out small regions, especially of white and then find the bounding boxes of the black regions. (Unix bash syntax)

convert image.png -threshold 95% \
-shave 5x5 -bordercolor white -border 5 \
-blur 0x2.5 -threshold 99% -type bilevel \
-define connected-components:verbose=true \
-define connected-components:area-threshold=20 \
-define connected-components:mean-color=true \
-connected-components 4 \
+write tmp.png null: | grep "gray(0)" | tail -n +2 | sed 's/^[ ]*//' | cut -d\  -f2`

This is the tmp.png image that was created. Note that I have discarded regions smaller than 20 pixels in area. Adjust as desired. Also adjust the blur as desired. You can make it larger to get bigger connected regions or smaller to get closer to individual lines of text. I shaved 5 pixels all around to remove spot noise at the top of your image and then padded with a border of 5 pixels white.

This is the list of bounding boxes.

Here is the listing:


We can go one step further an draw boxes about the regions:

bboxArr=(`convert image.png -threshold 95% \
-shave 5x5 -bordercolor white -border 5 \
-blur 0x2.5 -threshold 99% -type bilevel \
-define connected-components:verbose=true \
-define connected-components:area-threshold=20 \
-define connected-components:mean-color=true \
-connected-components 4 \
+write tmp.png null: | grep "gray(0)" | sed 's/^[ ]*//' | cut -d\  -f2`)
for ((i=0; i<num; i++)); do
WxH=`echo "${bboxArr[$i]}" | cut -d+ -f1`
xo=`echo "${bboxArr[$i]}" | cut -d+ -f2`
yo=`echo "${bboxArr[$i]}" | cut -d+ -f3`
ww=`echo "$WxH" | cut -dx -f1`
hh=`echo "$WxH" | cut -dx -f2`
boxes="$boxes rectangle $x1,$y1 $x2,$y2"
convert image.png -fill none -strokewidth 2 -stroke red -draw "$boxes" -alpha off image_boxes.png

Increase the threshold-area from 20 a little more and you can get rid of the tiny box on the lower left side around a round dot, which I think is noise.


Im trying to train the multi output model. Im loading the images in batches as follows,

def get_batch_features(self, idx):
    return np.array([load_image(im) for im in self.im_list[idx * self.bsz: (1 + idx) * self.bsz]])

Following is my load_image function where im normalizing the images to range between 0 and 255 as follows

def load_image(im):
    return img_to_array(load_img(im, target_size=(224, 224))) / 255.

Im loading the labels which are the target coordinates of 4 xy coordinates.

def get_batch_labels(self, idx):
    return self.labels[idx * self.bsz: (idx + 1) * self.bsz,:]

How do I normalize the target coordinates by scaling it to [-1, 1]? since im not scaling it, im getting a huge validation loss as the model is overfitting. Is there by any means that i can scale the target coordinates between [-1,1]?


Assuming that your target coordinates are somewhere in the interval [0,223] as this is how many pixels your images have, what about just adjusting this to [-111.5,111.5] by subtracting 111.5 and dividing by 111.5 afterwards?

return (self.labels[idx * self.bsz: (idx + 1) * self.bsz,:]-111.5)/111.5

Actually, from my experience, you don't need to hit [-1,1] precisely, it should be sufficient to just divide them by 100 so that they are in the right order of magnitude. Besides that, you could also compute the statistics over all labels and normalize them so that they follow zero mean/unit variance which is a common strategy.


i have the following task: i'm supposed to find the coordinates of an targetpoint. The features that are given, are the distances from anchors to that targetpoint. See img 1 distances from anchors to target

I planned to create a simple neural network first just with input and output layer. The cost-function i try to minimize is: correct_coordinate - mean of square(summed_up_distances*weights). But now i'm kind of stuck in how to model the neural network, so that i'm outputting coordinates [x,y], as the current model would just output a single value. See img 2 current model

Right now I would than just train 2 neural networks. One that outputs the x-value, and one that outputs the y-value. I'm just not sure if that is the best practice with tensorflow.

So I would like to know, how would you model the NN with tensorflow?


You can build the network with 2 nodes in the output layer. there is no need to train 2 neural networks for the same task.