pyrfm: A library for random feature maps in Python.¶

pyrfm¶

A library for random feature maps and linear models with random feature maps in Python.

Installation¶

Download the source codes by:

git clone https://github.com/neonnnnn/pyrfm.git

or download as a ZIP from GitHub.

Install the dependencies::

cd pyrfm

pip install -r requirements.txt

Finally, build and install pyrfm by:
```
python setup.py install
```

For running example codes (in pyrfm/benchmarks), jupyter and matplotlib are required.

Github¶

https://github.com/neonnnnn/pyrfm

What are random feature maps?¶

Using random feature maps is a promising way for large-scale kernel methods. They are maps from an original feature space to a randomized feature space approximating a kernel-induced feature space. The basic idea is to run linear models on such randomized feature space for classification, regression, clustering, etc. When the dimension of the random feature map D is not so high and the number of training example N is large, this approach is very efficient compared to canonical kernel methods.

Random Feature Maps Implemented¶

pyrfm follows the scikit-learn API and now supports following random features.

RandomFourier: random Fourier feature (for the RBF kernel) [RR08]
RandomMaclaurin: random Maclaurin feature (for the polynomial kernel, exp kernel, and user-specified dot product kernels) [KK12]
TensorSketch: tensor sketching (for the polynomial kernel) [PP13]
RandomKernel: random kernel feature (for the ANOVA kernel and all-subsets kernel) [ASO19]
MB: S.Maji and A.Berg feature (for the intersection (min) kernel) (this feature is not random) [MB09]

In other words, pyrfm now provides approximaters for following kernels.

RBF kernel (RandomFourier)
polynomial kernel (RandomMaclaurin, TensorSketching)
exponential kernel (RandomMaclaurin)
user-specified dot product kernel (RandomMaclaurin, requiring Maclaurin coefficients of specified kernel)
ANOVA kernel (RandomKernel)
all-subsets kernel (RandomKernel)
intersection (min) kernel (MB)

The random Fourier feature is also implemented in scikit-learn (kernel_approximation.RBFSampler).

Furthermore, pyrfm supports following structured random features.

SignedCirculantRandomMatrix: signed circulant random matrix (for the dot product / RBF kernel) [FHL15]
SignedCirculantRandomKernel: signed circulant random kernel feature (for the ANOVA kernel) [ASO19]
SubsampledRandomHadamard: subsampled random Hadamard transform (for the dot product) [Tro11]
FastFood: fastfood (for the dot product / RBF kernel) [LSarlosS13]
CompactRandomFeature: compact random features [HXGD14] (with subsampled ranadom Hadamard transform or random projection [LHC06])
OrthogonalRandomFeature / StructuredOrthogonalRandomFeature: orthogonal random feature / structured orthogonal random feature (for the dot product / RBF kernel) [YSC+16]

These methods are faster and more memory-efficient than canonical random features such as random Fourier, random kernel, etc. We believe that you can use these structured random features as a subroutine of your proposed random features, and SignedCirculantRandomKernel is an example of it (SignedCirculantRandomMatrix is used as a subroutine).

Moreover, following data-dependent random feature methods are supported:

LearningKernelwithRandomFeature [SD16]

Additionaly, we provide following random projection methods.

Linear Models Implemented¶

Moreover, pyrfm supports following solvers for linear models with random features.

SparseMBRegressor / SparseMBClassifier: primal coordinate descent for sparse S.Maji and A.Berg feature [MB09] [CHL08]
AdaGradRegressor / AdaGradClassifier: AdaGrad for very-large-scale dataset: does not compute the random feature map of all examples at the same time (space efficient but slow) [DHS11]
AdamRegressor / AdamClassifier: Adam for very-large-scale dataset: does not compute the random feature map of all examples at the same time (space efficient but slow) [KB15]
DoublySGDRegressor / DoublySGDClassifier: Doubly stochastic gradient method proposed in [DXH+14].
SDCARegressor / SDCAClassifier: SDCA for very-large-scale dataset: does not compute the random feature map of all examples at the same time (space efficient but slow) [SSZ13]
SGDRegressor / SGDClassifier: SGD/ASGD for very-large-scale dataset: does not compute the random feature map of all examples at the same time (space efficient but slow) [Bot10] [Bot12]
SAGARegressor / SAGAClassifier: SAG/SAGA for very-large-scale dataset: does not compute the random feature map of all examples at the same time (space efficient but slow) [DBLJ14] [SLRB17]

All methods support squared loss for regression and hinge loss, squared hinge loss, and logistic loss for classification.

AdaGrad, SDCA, Adam, SGD/ASGD, and SAG/SAGA in pyrfm are for a very-large-scale dataset such that computing its random feature matrix (i.e., computing random features for all instances at the same time) causes MemoryError. If you can allocate memory for random feature matrix of your training data, you should use the other implementations of linear models (linear_model in scikit-learn, sklearn-contrib-lightning, etc). DoublySGD solvers do not keep the random weights but samples them at each iteration. DoublySGD solver increase the number of random weights vector at every iteration. We recommend batch_size > 1 and n_bases_sampled > 1. For more details, please see [DXH+14].

Now, these stochastic solvers except DoublySGD solvers run efficiently for following random features.

RBFSampler and SkewedChi2Sampler (in sklearn.kernel_approximation)
RandomFourier
RandomMaclaurin
TensorSketch
RandomKernel
SignedCirculantRandomProjection
FastFood
SubsampledRandomHadamardTransform
CompactRandomFeature
OrthogonalRandomFeature / StructuredOrthogonalRandomFeature
AdditiveChi2Sampler (wrapper of AdditiveChi2Sampler in sklearn.kernel_approximation)

For improving efficiency, implement cdef class and cdef transform method for your desired transformer. Please see random_feature/random_features_fast.pyx/pxd. Although these stochastic solvers support any transformers, they might run unbelievable slow when there is no cdef class and cdef transform method for your desired transformer in random_features_fast.pyx. DoublySGD solvers supports only RBFSampler, RandomFourier, RandomMaclaurin, RandomKernel, and SkewedChi2Sampler. We believe that these implementations can be used for researches.

References¶

ASO19(1,2): Kyohei Atarashi, Maji Subhransu, and Satoshi Oyama. Random feature maps for the itemset kernel. In Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence. 2019.
Bot10: Léon Bottou. Large-scale machine learning with stochastic gradient descent. In Proceedings of COMPSTAT’2010, pages 177–186. Springer, 2010.
Bot12: Léon Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, pages 421–436. Springer, 2012.
CHL08: Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. Coordinate descent method for large-scale l2-loss linear support vector machines. Journal of Machine Learning Research, 9(Jul):1369–1398, 2008.
CCFC02: Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. In International Colloquium on Automata, Languages, and Programming, 693–703. Springer, 2002.
DXH+14(1,2): Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song. Scalable kernel methods via doubly stochastic gradients. In Advances in Neural Information Processing Systems, 3041–3049. 2014.
DBLJ14: Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: a fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, 1646–1654. 2014.
DHS11: John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
FHL15: Chang Feng, Qinghua Hu, and Shizhong Liao. Random feature mapping with signed circulant matrix projection. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence. 2015.
HXGD14: Raffay Hamid, Ying Xiao, Alex Gittens, and Dennis DeCoste. Compact random feature maps. In Proceedings of the International Conference on Machine Learning, 19–27. 2014.
KK12: Purushottam Kar and Harish Karnick. Random feature maps for dot product kernel. In Proceedings of the Artificial Inteligence and Statistics, 583–591. 2012.
KB15: Diederik P Kingma and Jimmy Ba. Adam: a method for stochastic optimization. In Proceedings of the International Conference on Learning Representations. 2015.
LSarlosS13: Quoc Le, Tamás Sarlós, and Alex Smola. Fastfood-approximating kernel expansions in loglinear time. In Proceedings of the International Conference on Machine Learning, volume 85. 2013.
LHC06(1,2): Ping Li, Trevor J Hastie, and Kenneth W Church. Very sparse random projections. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 287–296. ACM, 2006.
MB09(1,2): Subhransu Maji and Alexander C Berg. Max-margin additive classifiers for detection. In Proceedings of the IEEE International Conference on Computer Vision, 40–47. IEEE, 2009.
PP13: Ninh Pham and Rasmus Pagh. Fast and scalable polynomial kernels via explicit feature maps. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 239–247. ACM, 2013.
RR08: Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In Proceedings of the Advances in Neural Information Processing Systems, 1177–1184. 2008.
SLRB17: Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2):83–112, 2017.
SSZ13: Shai Shalev-Shwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567–599, 2013.
SD16: Aman Sinha and John C Duchi. Learning kernels with random features. In Advances in Neural Information Processing Systems, 1298–1306. 2016.
Tro11: Joel A Tropp. Improved analysis of the subsampled randomized hadamard transform. Advances in Adaptive Data Analysis, 3(01n02):115–126, 2011.
WDA+09: Kilian Weinberger, Anirban Dasgupta, Josh Attenberg, John Langford, and Alex Smola. Feature hashing for large scale multitask learning. arXiv preprint arXiv:0902.2206, 2009.
YSC+16: Felix Xinnan X Yu, Ananda Theertha Suresh, Krzysztof M Choromanski, Daniel N Holtmann-Rice, and Sanjiv Kumar. Orthogonal random features. In Proceedings of the Advances in Neural Information Processing Systems, 1975–1983. 2016.

Authors¶

Kyohei Atarashi, 2018-present