Skip to content

Commit 03b52b6

Browse files
committedSep 25, 2013
add bundles list
1 parent eb27cdd commit 03b52b6

File tree

3 files changed

+102
-15
lines changed

3 files changed

+102
-15
lines changed
 

‎.gitignore

-1
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@ codalab/media/**/**
66

77
# Bundles prototype #
88
#####################
9-
bundles/pliang
109
bundles/generated
1110
bundles/html_out
1211
bundles/README.html

‎bundles/BUNDLES.md

+85
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,85 @@
1+
# Bundles
2+
3+
One of the initial steps is to seed CodaLab with all the standard and
4+
state-of-the-art algorithms as well as popular datasets in machine learning,
5+
NLP, and computer vision. This document keeps track of the programs and
6+
datasets which are to be uploaded to CodaLab, as well as providing guidelines
7+
on how to do this.
8+
9+
This is necessarily going to be an incomplete list. One strategy is to read a
10+
couple of papers in an area (e.g., collaborative filtering), see what the
11+
standard datasets are, acquire those, and then obtain the implementations from
12+
those papers and reproduce the results.
13+
14+
Recall that each program and dataset is a Bundle, which is either:
15+
16+
- Just a directory that contains files, or
17+
- References to other Bundles (typically a program and a dataset) and a command
18+
to run.
19+
20+
Here are some guidelines:
21+
22+
- Document everything you do in the description of the Bundle.
23+
- Create one Bundle which just represents the raw data or code from the source
24+
without modifications. It basically should be just a unpacking of the zip
25+
file that is downloaded.
26+
- If code needs to be compiled, create a Bundle to do that, where the command
27+
is the compilation command (e.g., `make`).
28+
- If the data is in a non-standard format (for that task), then create another
29+
Bundle where the command does the conversion. For example, sequence tagging
30+
should use the CoNLL shared task format.
31+
- Programs will often have many ways of invoking them. Pick a few
32+
representative settings, and a small sample dataset, and create a run and
33+
document this.
34+
35+
### Utilities
36+
37+
- Converter between csv, tsv formats.
38+
- Programs that plot curves.
39+
40+
### Learning algorithms
41+
42+
- [Weka](http://www.cs.waikato.ac.nz/ml/weka/): a comprehensive Java library
43+
with many different algorithms.
44+
- [scikit-learn](http://scikit-learn.org/stable/): a Python library which is
45+
popular and good for prototyping.
46+
- [R](http://cran.us.r-project.org/)
47+
- [Matlab](http://www.mathworks.com/discovery/machine-learning.html): licensing is tricky.
48+
- [Vowpal Wabbit](https://github.com/JohnLangford/vowpal_wabbit/)
49+
50+
We should make sure we have good solid implementations for the following algorithms:
51+
52+
- Naive Bayes
53+
- K-nearest neighbors
54+
- Boosted decision trees
55+
- Logistic regression (batch or stochastic updates)
56+
- SVM (batch or stochastic updates)
57+
58+
### Standard machine learning datasets
59+
60+
- [UCI repository](http://archive.ics.uci.edu/ml/): contains many classification and regression datasets.
61+
- Collaborative filtering?
62+
- Ranking?
63+
64+
### NLP datasets
65+
66+
- [CoNLL](http://www.clips.ua.ac.be/conll2003/): each year, CoNLL runs a Shared
67+
Task, which is a competition with a dataset.
68+
69+
We should get coverage on the following tasks:
70+
71+
- Named-entity recognition (CoNLL shared task 2002, 2003)
72+
- Semantic role labeling (CoNLL shared task 2004, 2005)
73+
- Dependency parsing (CoNLL shared task 2006)
74+
- Coreference resolution (MUC, CoNLL)
75+
- Text classification (Reuters, 20 news groups, sentiment, spam)
76+
- Constituency parsing (Wall Street Journal, [Google Web Treebank Weblogs](http://mlcomp.org/datasets/1002))
77+
- Machine translation [NIST competition](http://www.nist.gov/itl/iad/mig/openmt12.cfm)
78+
79+
### Vision datasets
80+
81+
- [CIFAR](http://www.cs.toronto.edu/~kriz/cifar.html)
82+
- [STL10](http://www.stanford.edu/~acoates/stl10/)
83+
- [CV papers](http://www.cvpapers.com/datasets.html): an impressive list of
84+
computer vision datasets for detection, classification, recognition,
85+
segmentation, etc.

‎bundles/README.md

+17-14
Original file line numberDiff line numberDiff line change
@@ -1,18 +1,9 @@
11
### Standard programs and datasets
22

3-
To download some relevant programs and datasets, type:
3+
To get started, download some relevant programs and datasets, type:
44

55
./download.sh
66

7-
### Worksheet prototype
8-
9-
To run the test the Worksheet system, type:
10-
11-
python controller.py
12-
13-
This should generate use the Bundles/Worksheets in `pliang` and generate new
14-
ones in `generated`. The HTML visualization of the output is in `html_out`.
15-
167
### Command-line utility prototype
178

189
To start the worker process:
@@ -24,7 +15,7 @@ Now you can run commands to upload programs/datasets and run them.
2415
./basic_ml.sh
2516

2617
There are several design changes in the new command-line utility prototype
27-
(compared to the worksheet prototype):
18+
(compared to the worksheet prototype `controller.py`, see below):
2819

2920
- Bundles have a much more uniform design more centered around running commands
3021
and dependency management. Now, a Bundle can independently have
@@ -35,6 +26,18 @@ There are several design changes in the new command-line utility prototype
3526
- Programs are *uploaded* from the `pliang` directory rather than just running
3627
in place.
3728
- A sqlite database is used to store all the Bundle information (so it will be
38-
more similar to the final version).
39-
- All execution is done in a scratch directory, which is more safe. This also
40-
makes the data copying logic more explicit.
29+
more similar to the final web-based version).
30+
- All execution is done in a scratch directory. This also makes the data
31+
copying logic explicit.
32+
33+
### Worksheet prototype
34+
35+
The Worksheet prototype is outdated and needs to be integrated with the new
36+
command-line prototype schema.
37+
38+
But if you still want to run the Worksheet system, type:
39+
40+
python controller.py
41+
42+
This should generate use the Bundles/Worksheets in `pliang` and generate new
43+
ones in `generated`. The HTML visualization of the output is in `html_out`.

0 commit comments

Comments
 (0)
Please sign in to comment.