shark::Data< Type > Class Template Reference

Data container. More...

#include <shark/Data/Dataset.h>

+ Inheritance diagram for shark::Data< Type >:

Public Types

typedef batch_type & batch_reference
 
typedef batch_type const & const_batch_reference
 
typedef Batch< element_type >::reference element_reference
 
typedef Batch< element_type >::const_reference const_element_reference
 
typedef std::vector< std::size_t > IndexSet
 
typedef boost::iterator_range< detail::DataElementIterator< Data< Type > > > element_range
 
typedef boost::iterator_range< detail::DataElementIterator< Data< Type > const > > const_element_range
 
typedef detail::BatchRange< Data< Type > > batch_range
 
typedef detail::BatchRange< Data< Type > const > const_batch_range
 

Public Member Functions

 BOOST_STATIC_CONSTANT (std::size_t, DefaultBatchSize=256)
 Defines the default batch size of the Container.
 
template<class T >
bool operator== (const Data< T > &rhs)
 Two containers compare equal if they share the same data.
 
template<class T >
bool operator!= (const Data< T > &rhs)
 Two containers compare unequal if they don't share the same data.
 
const_element_range elements () const
 Returns the range of elements.
 
element_range elements ()
 Returns therange of elements.
 
const_batch_range batches () const
 Returns the range of batches.
 
batch_range batches ()
 Returns the range of batches.
 
std::size_t numberOfBatches () const
 Returns the number of batches of the set.
 
std::size_t numberOfElements () const
 Returns the total number of elements.
 
Shape const & shape () const
 Returns the shape of the elements in the dataset.
 
Shapeshape ()
 Returns the shape of the elements in the dataset.
 
bool empty () const
 Check whether the set is empty.
 
element_reference element (std::size_t i)
 
const_element_reference element (std::size_t i) const
 
batch_reference batch (std::size_t i)
 
const_batch_reference batch (std::size_t i) const
 
 Data ()
 Constructor which constructs an empty set.
 
 Data (std::size_t numBatches)
 Construct a dataset with empty batches.
 
 Data (std::size_t size, element_type const &element, std::size_t batchSize=DefaultBatchSize)
 Construction with size and a single element.
 
void read (InArchive &archive)
 Read the component from the supplied archive.
 
void write (OutArchive &archive) const
 Write the component to the supplied archive.
 
virtual void makeIndependent ()
 This method makes the vector independent of all siblings and parents.
 
void splitBatch (std::size_t batch, std::size_t elementIndex)
 
Data splice (std::size_t batch)
 Splits the container into two independent parts. The front part remains in the container, the back part is returned.
 
void append (Data const &other)
 Appends the contents of another data object to the end.
 
void push_back (const_batch_reference batch)
 
template<class Range >
void repartition (Range const &batchSizes)
 Reorders the batch structure in the container to that indicated by the batchSizes vector.
 
std::vector< std::size_t > getPartitioning () const
 Creates a vector with the batch sizes of every batch.
 
template<class Range >
void reorderElements (Range const &indices)
 Reorders elements across batches.
 
void indexedSubset (IndexSet const &indices, Data &subset, Data &complement) const
 Fill in the subset defined by the list of indices as well as its complement.
 
Data indexedSubset (IndexSet const &indices) const
 
- Public Member Functions inherited from shark::ISerializable
virtual ~ISerializable ()
 Virtual d'tor.
 
void load (InArchive &archive, unsigned int version)
 Versioned loading of components, calls read(...).
 
void save (OutArchive &archive, unsigned int version) const
 Versioned storing of components, calls write(...).
 
 BOOST_SERIALIZATION_SPLIT_MEMBER ()
 

Protected Types

typedef detail::SharedContainer< Type > Container
 

Protected Attributes

Container m_data
 data
 
Shape m_shape
 shape of a datapoint
 

Friends

template<class InputT , class LabelT >
class LabeledData
 
void swap (Data &a, Data &b)
 

Detailed Description

template<class Type>
class shark::Data< Type >

Data container.

The Data class is Shark's container for machine learning data. This container (and its sub-classes) is used for input data, labels, and model outputs.

The Data container organizes the data it holds in batches. This means, that it tries to find a good data representation for a whole set of, for example 100 data points, at the same time. If the type of data it stores is for example RealVector, the batches of this type are RealMatrices. This is good because most often operations on the whole matrix are faster than operations on the separate vectors. Nearly all operations of the set have to be interpreted in terms of the batch. Therefore the iterator interface will give access to the batches but not to single elements. For this separate element_iterators and const_element_iterators can be used.
When you need to explicitely iterate over all elements, you can use:
for(auto elem: data.elements()){
std::cout<<*pos<<" ";
auto ref=*pos;
ref*=2;
std::cout<<*pos<<std::endl;
}
Element wise accessing of elements is usually slower than accessing the batches. If possible, use direct batch access, or at least use the iterator interface or the for loop above to iterate over all elements. Random access to single elements is linear time, so use it wisely. Of course, when you want to use batches, you need to know the actual batch type. This depends on the actual type of the input. here are the rules: if the input is an arithmetic type like int or double, the result will be a vector of this (i.e. double->RealVector or Int->IntVector). For vectors the results are matrices as mentioned above. If the vector is sparse, so is the matrix. And for everything else the batch type is just a std::vector of the type, so no optimization can be applied.
When constructing the container the batchSize can be set. If it is not set by the user the default batchSize is chosen. A BatchSize of 0 corresponds to putting all data into a single batch. Beware that not only the data needs storage but also the various models during computation. So the actual amount of space to compute a batch can greatly exceed the batch size.

An additional feature of the Data class is that it can be used to create lazy subsets. So the batches of a dataset can be shared between various instances of the data class without additional memory overhead.

Warning
Be aware –especially for derived containers like LabeledData– that the set does not enforce structural consistency. When you change the structure of the data part for example by directly changing the size of the batches, the size of the labels is not enforced to change accordingly. Also when creating subsets of a set changing the parent will change it's siblings and conversely. The programmer needs to ensure structural integrity! For example this is dangerous:
void function(Data<unsigned int>& data){
Data<unsigned int> newData(...);
data=newData;
}
When data was originally a labeledData object, and newData has a different batch structure than data, this will lead to structural inconsistencies. When function is rewritten such that newData has the same structure as data, this code is perfectly fine. The best way to get around this problem is by rewriting the code as:
Data<unsigned int> function(){
Data<unsigned int> newData(...);
return newData;
}
Todo:
expand docu

Definition at line 128 of file Dataset.h.


The documentation for this class was generated from the following file: