|
| 1 | +# Bundles |
| 2 | + |
| 3 | +One of the initial steps is to seed CodaLab with all the standard and |
| 4 | +state-of-the-art algorithms as well as popular datasets in machine learning, |
| 5 | +NLP, and computer vision. This document keeps track of the programs and |
| 6 | +datasets which are to be uploaded to CodaLab, as well as providing guidelines |
| 7 | +on how to do this. |
| 8 | + |
| 9 | +This is necessarily going to be an incomplete list. One strategy is to read a |
| 10 | +couple of papers in an area (e.g., collaborative filtering), see what the |
| 11 | +standard datasets are, acquire those, and then obtain the implementations from |
| 12 | +those papers and reproduce the results. |
| 13 | + |
| 14 | +Recall that each program and dataset is a Bundle, which is either: |
| 15 | + |
| 16 | +- Just a directory that contains files, or |
| 17 | +- References to other Bundles (typically a program and a dataset) and a command |
| 18 | + to run. |
| 19 | + |
| 20 | +Here are some guidelines: |
| 21 | + |
| 22 | +- Document everything you do in the description of the Bundle. |
| 23 | +- Create one Bundle which just represents the raw data or code from the source |
| 24 | + without modifications. It basically should be just a unpacking of the zip |
| 25 | + file that is downloaded. |
| 26 | +- If code needs to be compiled, create a Bundle to do that, where the command |
| 27 | + is the compilation command (e.g., `make`). |
| 28 | +- If the data is in a non-standard format (for that task), then create another |
| 29 | + Bundle where the command does the conversion. For example, sequence tagging |
| 30 | + should use the CoNLL shared task format. |
| 31 | +- Programs will often have many ways of invoking them. Pick a few |
| 32 | + representative settings, and a small sample dataset, and create a run and |
| 33 | + document this. |
| 34 | + |
| 35 | +### Utilities |
| 36 | + |
| 37 | +- Converter between csv, tsv formats. |
| 38 | +- Programs that plot curves. |
| 39 | + |
| 40 | +### Learning algorithms |
| 41 | + |
| 42 | +- [Weka](http://www.cs.waikato.ac.nz/ml/weka/): a comprehensive Java library |
| 43 | + with many different algorithms. |
| 44 | +- [scikit-learn](http://scikit-learn.org/stable/): a Python library which is |
| 45 | + popular and good for prototyping. |
| 46 | +- [R](http://cran.us.r-project.org/) |
| 47 | +- [Matlab](http://www.mathworks.com/discovery/machine-learning.html): licensing is tricky. |
| 48 | +- [Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit/) |
| 49 | + |
| 50 | +We should make sure we have good solid implementations for the following algorithms: |
| 51 | + |
| 52 | +- Naive Bayes |
| 53 | +- K-nearest neighbors |
| 54 | +- Boosted decision trees |
| 55 | +- Logistic regression (batch or stochastic updates) |
| 56 | +- SVM (batch or stochastic updates) |
| 57 | + |
| 58 | +### Standard machine learning datasets |
| 59 | + |
| 60 | +- [UCI repository](http://archive.ics.uci.edu/ml/): contains many classification and regression datasets. |
| 61 | +- Collaborative filtering? |
| 62 | +- Ranking? |
| 63 | + |
| 64 | +### NLP datasets |
| 65 | + |
| 66 | +- [CoNLL](http://www.clips.ua.ac.be/conll2003/): each year, CoNLL runs a Shared |
| 67 | + Task, which is a competition with a dataset. |
| 68 | + |
| 69 | +We should get coverage on the following tasks: |
| 70 | + |
| 71 | +- Named-entity recognition (CoNLL shared task 2002, 2003) |
| 72 | +- Semantic role labeling (CoNLL shared task 2004, 2005) |
| 73 | +- Dependency parsing (CoNLL shared task 2006) |
| 74 | +- Coreference resolution (MUC, CoNLL) |
| 75 | +- Text classification (Reuters, 20 news groups, sentiment, spam) |
| 76 | +- Constituency parsing (Wall Street Journal, [Google Web Treebank Weblogs](http://mlcomp.org/datasets/1002)) |
| 77 | +- Machine translation [NIST competition](http://www.nist.gov/itl/iad/mig/openmt12.cfm) |
| 78 | + |
| 79 | +### Vision datasets |
| 80 | + |
| 81 | +- [CIFAR](http://www.cs.toronto.edu/~kriz/cifar.html) |
| 82 | +- [STL10](http://www.stanford.edu/~acoates/stl10/) |
| 83 | +- [CV papers](http://www.cvpapers.com/datasets.html): an impressive list of |
| 84 | + computer vision datasets for detection, classification, recognition, |
| 85 | + segmentation, etc. |
0 commit comments